TY - GEN
T1 - A Reality Check of Vision-Language Pre-training in Radiology
T2 - 29th International Conference on Information Processing in Medical Imaging, IPMI 2025
AU - Silva-Rodríguez, Julio
AU - Dolz, Jose
AU - Ayed, Ismail Ben
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Vision language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision. Code and weights are available: https://github.com/jusiro/DLILP.
AB - Vision language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision. Code and weights are available: https://github.com/jusiro/DLILP.
KW - Radiology
KW - Transfer learning
KW - Vision-language pre-training
UR - https://www.scopus.com/pages/publications/105013623264
U2 - 10.1007/978-3-031-96625-5_20
DO - 10.1007/978-3-031-96625-5_20
M3 - Contribution to conference proceedings
AN - SCOPUS:105013623264
SN - 9783031966248
T3 - Lecture Notes in Computer Science
SP - 294
EP - 309
BT - Information Processing in Medical Imaging - 29th International Conference, IPMI 2025, Proceedings
A2 - Oguz, Ipek
A2 - Zhang, Shaoting
A2 - Metaxas, Dimitris N.
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 25 May 2025 through 30 May 2025
ER -