Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs

Authority: Journal of Advances in Information Technology
Category: Journal Publication

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.

Interdisciplinary Research Centers (IRCs)

Applied Research Centers (ARCs)

Joint Research Centers (JRCs)

Research Support

CONSORTIA

Chair Professors

Visiting Scholars & Postdocs

Publications

Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs