To evaluate the translational potential of large vision-language models (VLMs) for retinal imaging–based Alzheimer’s disease (AD) screening.
We evaluated and compared the performance of five large VLMs, LLaMA (Large Language Model Meta AI), LLaVA (Large Language and Vision Assistant), LLaVA-Med (Large Language and Vision Assistant for Biomedicine), Qwen (LLMs developed by Alibaba Cloud), and a domain-specific model RetinalGPT, in detecting AD from retinal color fundus photographs (CFPs). The Mayo Clinic AD dataset includes 283 CFPs from 102 AD subjects and 258 CFPs from 129 cognitively normal controls. All models were executed locally on secured servers. RetinalGPT was fine-tuned on retinal imaging data; the others were general-purpose or trained on biomedical corpora. Model outputs were categorized as positive, negative, or uncertain. Evaluation metrics included uncertainty ratio (UR), classification accuracy (ACC), F1 score, and reasoning quality (RS).
LLaMA exhibited limited diagnostic relevance (UR=74%, RS=1.08±0.49). LLaVA (UR=99%, RS=1.99±1.00) and LLaVA-Med (UR=99%, RS=2.75±0.70) demonstrated stronger reasoning but lacked specificity for AD features. Qwen employed deeper reasoning strategies, yielding modest accuracy (UR=23.6%, ACC=53.3%, F1=0.35, RS=1.62±1.30) despite no medical tuning. In contrast, RetinalGPT outperformed others in classification relevance (UR=85.4%, ACC=55.7%, F1=0.61, RS=2.59±1.03), showing improved specificity and concise decision-making.
Domain-specific adaptation substantially improved diagnostic alignment. RetinalGPT achieved higher accuracy and medically relevant reasoning compared to general-purpose VLMs, underscoring the necessity of targeted fine-tuning. These findings highlight both the promise and limitations of deploying foundation models for sensitive clinical applications like AD screening.