Multimodal Bone Fracture Detection: Combining Resnet50 Transfer Learning With Synthetic Radiological Captions

This study presents a multimodal pipeline for bone fracture detection by in-tegrating a fine-tuned ResNet-50 CNN with synthetic radiological captions generated by MedGemma-4B, a medical vision-language model. The frame-work incorporates Grad-CAM for visual interpretability and a Gradio-based interface for clinical use. Using a public dataset of 420 radiographs (289 non-fractured and 131 fractured) from Mendeley Data, the images were pre-processed to 224 × 224 pixels and enhanced through random flips, rotations, and contrast adjustments. The ResNet-50 backbone was adapted for single-channel input and trained using selective fine-tuning. MedGemma-4B, which was optimized with 4-bit quantization, generated structured radiological find-ings. Textual features (262-dimensional TF-IDF vectors) were concatenated with 2,048-dimensional image embeddings to form a 2,310-dimensional multimodal vector for the final classification. While the unimodal ResNet-50 achieved a validation accuracy of 69.05% (F1 = 0.58), the multimodal fu-sion reached 95.24% accuracy, with perfect recall (1.0) and an F1-score of 0.9286, effectively eliminating false negatives. This demonstrates that aug-menting deep visual features with AI-generated descriptions significantly im-proves diagnostic performance on small, imbalanced datasets, bridging the gap between research and clinical utility.

Carlos Florez
Universidad de Cordoba
Colombia

Sarah Córdoba
Universidad de Cordoba
Colombia

Jorge Gomez
Universidad de Cordoba
Colombia

Daniel Salas
Universidad de Cordoba
Colombia

Oswaldo Velez
Universidad de Cordoba
Colombia