Meta-Analysis of Deep Learning in Medical Imaging Highlights Need for Standardized Reporting and Guidelines

December 12, 2024

Mohd Aasim

Artificial intelligence (AI) and its subfield, deep learning, open new avenues for expressive and predictive or prescriptive analysis to make possible insight otherwise unattainable because of manual analyses. Algorithms based on deep learning, for example, architectures like convolutional neural networks (CNNs) are distanced from typical cases of machine learning since they maintain the training process in recognizing sophisticated representations to improve performance recognizing patterns from raw data and not through previous human engineering and domain expertise structuring of data and designing feature extractors.

Hence, the purpose of this meta-analysis study is to quantify the diagnostic accuracy of deep learning in specialty-specific radiologic imaging modalities to identify or classify disease and to assess the differences in methodologies and reporting from deep learning (DL) based radiological diagnosis, i.e., the most common pitfalls pervasive through the whole field and published in NPJ Digital Medicine.

The author aimed to assess the diagnostic accuracy of DL algorithms for detecting pathology in medical imaging. The author performed research in Medline and Embase up to January 2020, identifying 11,921 studies. Of these, 503 were included in the systematic review. For meta-analysis, 82 studies focused on ophthalmology, 82 on breast disease, and 115 on respiratory disease were included. Moreover, 224 studies from other specialties were reviewed qualitatively. Only peer-reviewed studies reporting on the diagnostic accuracy of DL algorithms for identifying pathology through medical imaging were included. The primary outcomes evaluated were diagnostic accuracy, study design, and reporting standards in the literature. The secondary outcomes were study design and quality of reporting. Estimates were pooled utilizing random-effects meta-analysis.

In ophthalmology, the area under the curve (AUC) ranged from 0.933 to 1 for diagnosing diabetic retinopathy, age-related macular degeneration, and glaucoma using retinal fundus photographs and optical coherence tomography. In respiratory imaging, AUCs ranged from 0.864 to 0.937 for diagnosing lung nodules or lung cancer on chest X-rays or computed tomography (CT) scans. For breast imaging, AUCs ranged from 0.868 to 0.909 for diagnosing breast cancer using mammograms, ultrasound, magnetic resonance imaging (MRI), and digital breast tomosynthesis. There was significant heterogeneity across studies, with considerable variation in methodology, terminology, and outcome measures. This variation may lead to an overestimation of the diagnostic accuracy of DL algorithms in medical imaging. Therefore, there is an urgent need for the development of AI-specific enhancement of the quality and transparency of health research (EQUATOR) guidelines, particularly the standards for reporting diagnostic accuracy studies (STARD) guidelines, to address key issues in this field.

Primarily, authors are of the opinion that numerous investigations assume methodological deficiencies or poor reporting, hence they are not good sources for estimating diagnostic accuracy. Hence, the derived estimates for diagnostic performance in our meta-analysis are too uncertain and seem to provide an overestimation of true accuracy.

Secondly, the authors did not conduct a quality assessment for the transparency of reporting in this review. This was because current guidelines STARD-2015 were not designed for DL studies and are not fully applicable to the specifics and nuances of DL research.

Thirdly, data are typically DL studies; hence, it was not possible to perform classical statistical comparisons of disease- or diagnostic-variable accuracies across imaging modalities. Furthermore, since the study was conducted for an overview of the literature concerning each specialty, breaking the imaging modalities into subsets for inter-subset comparisons and allowing the splitting of heterogeneity and variance would go beyond the scope of this review.

For the quality of DL research to flourish in the future, authors believe that the implementation of the following recommendations is required as a starting point:

Availability of large, open-source, diverse anonymized datasets with annotations.
Collaboration with academic centers to utilize their expertise in pragmatic trial design and methodology.
Creation of AI-specific reporting standards.

Reference:

1) Hill, B.G., Koback, F.L. & Schilling, P.L. The risk of shortcutting in deep learning algorithms for medical imaging research. Sci Rep 14, 29224 (2024). https://doi.org/10.1038/s41598-024-79838-6

2) Aggarwal R, Sounderajah V, Martin G, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65. doi:10.1038/s41746-021-00438-z