Performance of Successive Generative Pre-trained Transformer (GPT) Models in Medical Cases and Board-style Questions
Anshum Patel1, Het Contractor1, Hayden Heninger1, Sai Krishna Vallamchetla1, Li Pengze1, Cui Tao1, Joseph Cheung1
1Mayo Clinic, Florida
Objective:
To benchmark the performance trajectory of successive GPT models in sleep medicine, assessing diagnostic capacity with clinical vignettes and domain knowledge via board-style MCQs to test the hypothesis that performance gains are plateauing.
Background:
Large language models (LLMs) show rapid advances in clinical reasoning, yet their trajectory in specialized domains remains incompletely defined.
Design/Methods:
We conducted a comparative evaluation of of six OpenAI models—GPT‑3.5 Turbo, GPT‑4‑Turbo, GPT‑4o, GPT‑4.1, GPT‑o3, and GPT‑5. Performance was benchmarked on two datasets: 78 AASM case vignettes and 897 board‑style multiple‑choice questions (MCQs). Standardized single‑best prompts were used, runs were independent, and default decoding was applied. Pairwise comparisons used McNemar’s exact tests with Holm–Bonferroni correction (two‑sided α=0.05). Models were accessed via API July–September 2025; datasets included the American Academy of Sleep Medicine Case Book and subscription board‑review banks to minimize training‑data contamination.
Results:
Diagnostic accuracy rose with model generation: 74.4% (58/78) for GPT‑3.5 Turbo; 73.1% for GPT‑4‑Turbo; 78.2% for GPT‑4o; 89.7% for GPT‑4.1; 93.6% for GPT‑o3; and 91.0% for GPT‑5. MCQ accuracy increased from 56.9% (510/897) with GPT‑3.5 Turbo to 93.0% (834/897) with GPT‑5 (GPT‑o3, 92.4%; GPT‑4.1, 85.4%). Advanced models significantly outperformed earlier iterations on both tasks after adjustment (P<0.05); on MCQs, GPT‑5 and GPT‑o3 were statistically indistinguishable. By disorder subgroup, GPT‑o3 and GPT‑4.1 achieved 100% accuracy for insomnia and other sleep disorders, whereas GPT‑5 attained 100% accuracy for circadian rhythm and sleep‑related movement disorders. Gains were smaller between the most recent generations, suggesting decelerating improvement.
Conclusions:
Successive generations of GPT models demonstrate significant and progressive improvements in both diagnostic reasoning and knowledge recall in sleep medicine. The observed performance plateau among state-of-the-art models suggests that while LLMs show promise as clinical decision support tools, future progress toward clinical-grade reliability may necessitate a strategic shift from generalist training to domain-specific fine-tuning with curated medical data.
10.1212/WNL.0000000000217415
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.