Multimodal Bone Fracture Detection: Combining Resnet50 Transfer Learning With Synthetic Radiological Captions
This study presents a multimodal pipeline for bone fracture detection by in-tegrating a fine-tuned ResNet-50 CNN with synthetic radiological captions generated by MedGemma-4B, a medical vision-language model. The frame-work incorporates Grad-CAM for visual interpretability and a Gradio-based interface for clinical use. Using a public dataset of 420 radiographs (289 non-fractured and 131 fractured) from Mendeley Data, the images were pre-processed to 224 × 224 pixels and enhanced through random flips, rotations, and contrast adjustments. The ResNet-50 backbone was adapted for single-channel input and trained using selective fine-tuning. MedGemma-4B, which was optimized with 4-bit quantization, generated structured radiological find-ings. Textual features (262-dimensional TF-IDF vectors) were concatenated with 2,048-dimensional image embeddings to form a 2,310-dimensional multimodal vector for the final classification. While the unimodal ResNet-50 achieved a validation accuracy of 69.05% (F1 = 0.58), the multimodal fu-sion reached 95.24% accuracy, with perfect recall (1.0) and an F1-score of 0.9286, effectively eliminating false negatives. This demonstrates that aug-menting deep visual features with AI-generated descriptions significantly im-proves diagnostic performance on small, imbalanced datasets, bridging the gap between research and clinical utility.
