A Comparative Analysis of Next-generation Large Language Models in Neurological Diagnostics
Aditi Agarwal1, Anshum Patel2, Ipshita Garg3
1Bharati Vidyapeeth University Medical College, 2Mayo Clinic, Florida, 3University of Texas at Tyler
Objective:
To evaluate and compare the diagnostic accuracy and confidence calibration of two next-generation Large Language Models (LLMs), Gemini 2.5 Pro and ChatGPT-5, on a diverse set of complex neurological case vignettes.
Background:
The application of LLMs in clinical diagnostics is rapidly evolving. However, rigorous comparative evaluation of leading models like Gemini 2.5 Pro and ChatGPT-5 on complex neurological cases with multimodal data remains limited. Such benchmarks are critical to understanding their potential role and limitations in clinical decision support.
Design/Methods:
A dataset of 51 neurological case vignettes, sourced from academic materials and encompassing a wide range of diagnoses, was used. Each case included patient history, examination findings, and test results. Both LLMs were presented with the same standardized prompt for each case and asked to provide a final diagnosis. Diagnostic accuracy was determined by comparison to a gold-standard diagnosis. A McNemar's test was used to compare accuracy rates. Model confidence scores (0-10) were analyzed using t-tests to compare calibration between correct and incorrect diagnoses.
Results:
ChatGPT‑5 was correct in 45/50 (90.0%, 95% CI 78.6–95.7%); Gemini in 44/50 (88.0%, 95% CI 76.2–94.4%). The McNemar's test revealed no statistically significant difference in their performance (p=0.739). However, the models differed significantly in confidence calibration. ChatGPT-5 demonstrated superior calibration, with mean confidence for correct diagnoses being significantly higher than for incorrect diagnoses (9.42 vs. 7.80, p=0.04). Gemini 2.5 Pro's confidence was uniformly high and did not differ significantly between correct (9.66) and incorrect (9.67) diagnoses. Qualitative analysis identified instances of critical, high-confidence errors by both models.
Conclusions:
Our findings suggest that Both models achieved high accuracy with no significant paired difference. This highlights their potential as supportive tools in neurological education and clinical decision-making. Future studies should focus on real-world clinical validation and the integration of these tools into safe, clinician-led workflows.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.