Stanford Study Says Higher ImageNet Performance Does Not Improve Medical Image Interpretation
In the paper CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation, Stanford University researchers address the assumption that boosting deep learning model performance on ImageNet translates to better performance on medical imaging tasks. AI Guru Andrew Ng participated in the study, which, surprisingly, concludes that’s not the case.
In 2017, Ng and his Stanford team introduced the groundbreaking deep learning (DL) model CheXNet, which detects pneumonia from chest X-rays with accuracy exceeding that of practicing radiologists. Four years later, most DL approaches for chest X-ray interpretation rely on pretrained models developed using popular large image database ImageNet. The de-facto transfer learning DL approaches have assumed the ImageNet pre-trained weights would lead to better model performance, and that better ImageNet based architectures would also perform better on chest X-ray interpretation.
Architecture improvements on ImageNet however may not translate into gains in medical imaging task performance. Why? The researchers note that the older and established architectures designed for the CheXpert large radiograph dataset outperform newer architectures generated through search on ImageNet, and that “this finding suggested that search may have overfit to ImageNet to the detriment of medical task performance, and ImageNet may not be an appropriate benchmark for selecting architecture for medical imaging tasks.”
The researchers compared the transfer performance and parameter efficiency of 16 popular convolutional architectures on five tasks on CheXpert, which comprises 224,316 X-rays of 65,240 patients. The architectures — DenseNet (121, 169, 201), ResNet (18, 34, 50, 101), Inception (V3, V4), MNASNet, EfficientNet (B0, B1, B2, B3), and MobileNet (V2, V3) — were evaluated using AUC — ROC AUC Metrics (Compute Area Under the Receiver Operating Characteristic Curve).
Paper co-author and Stanford University PhD student Pranav Rajpurkar, listed four key findings:
- Architecture improvements on ImageNet do not lead to improvements on chest x-ray interpretation
- Surprisingly, model size matters less than model family when models aren’t pretrained
- ImageNet pretraining helps, especially for smaller models
- Many layers can be discarded to reduce size of a model without performance drop
The researchers explain that differences in the chest X-ray interpretation task and data attributes could be why the ImageNet performance showed no correlation with CheXpert performance. The medical imaging task differs from natural image classification due to its dependence on abnormalities in a small number of pixels and has far fewer classes than natural image classification datasets. Data attributes for chest X-rays also vary from natural image classification, as X-rays are grayscale and tend to have similar spatial structures across images, etc.
The Stanford study re-examines standard assumptions and improves understanding of the transfer performance and parameter efficiency of ImageNet DL models for chest X-ray interpretation. The researchers believe this is the first study of its kind, and hope it can encourage further exploration of ImageNet architectures and downstream medical task performance.
The paper CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation is available on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen