A Comparative Study of Large Vision-language Models for Retinal Image Based Alzheimer’s Disease Identification
Vincent Wang1, Yiyi Sun1, Oana Dumitrascu2
1Brophy College Prep, 2Mayo Clinic
Objective:

To evaluate the translational potential of large vision-language models (VLMs) for retinal imaging–based Alzheimer’s disease (AD) screening.

Background:
Retinal imaging offers a non-invasive window into neurodegeneration and is increasingly studied as a biomarker for AD. At the same time, VLMs have shown broad reasoning capabilities across general tasks, raising interest in their applicability to medical imaging. Whether such models can provide clinically meaningful insight for AD detection from retinal fundus photographs remains unknown.
Design/Methods:

We evaluated and compared the performance of five large VLMs, LLaMA (Large Language Model Meta AI), LLaVA (Large Language and Vision Assistant), LLaVA-Med (Large Language and Vision Assistant for Biomedicine), Qwen (LLMs developed by Alibaba Cloud), and a domain-specific model RetinalGPT, in detecting AD from retinal color fundus photographs (CFPs). The Mayo Clinic AD dataset includes 283 CFPs from 102 AD subjects and 258 CFPs from 129 cognitively normal controls. All models were executed locally on secured servers. RetinalGPT was fine-tuned on retinal imaging data; the others were general-purpose or trained on biomedical corpora. Model outputs were categorized as positive, negative, or uncertain. Evaluation metrics included uncertainty ratio (UR), classification accuracy (ACC), F1 score, and reasoning quality (RS).

Results:

LLaMA exhibited limited diagnostic relevance (UR=74%, RS=1.08±0.49). LLaVA (UR=99%, RS=1.99±1.00) and LLaVA-Med (UR=99%, RS=2.75±0.70) demonstrated stronger reasoning but lacked specificity for AD features. Qwen employed deeper reasoning strategies, yielding modest accuracy (UR=23.6%, ACC=53.3%, F1=0.35, RS=1.62±1.30) despite no medical tuning. In contrast, RetinalGPT outperformed others in classification relevance (UR=85.4%, ACC=55.7%, F1=0.61, RS=2.59±1.03), showing improved specificity and concise decision-making.

Conclusions:

Domain-specific adaptation substantially improved diagnostic alignment. RetinalGPT achieved higher accuracy and medically relevant reasoning compared to general-purpose VLMs, underscoring the necessity of targeted fine-tuning. These findings highlight both the promise and limitations of deploying foundation models for sensitive clinical applications like AD screening.

10.1212/WNL.0000000000212817
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.