Large Language Model Performance in Neurology Board Questions
Liqi Shu1, Daniel Mandel2, Oliver Y. Tang3, Zhaowei Jiang1, Eric Goldstein1, Ali Mahta1
1Brown University, 2University Miami, 3University of Pittsburgh Medical Center
Objective:
Evaluate and compare the performance of Large Language Model (LLM) on Neurology board exam, with human test-takers.
Background:

Artificial intelligence (AI) has revolutionized various industries, with notable impacts in areas like autonomous vehicles, and image recognition. Despite AI's prevalence, its capability in medical clinical decision-making remains understudied. ChatGPT, a LLM developed by OpenAI, has shown promise in the medical field, even passing exams like the USMLE. However, its capacity in comparison to human test-takers, especially for the neurology board exam, remains an area of exploration.

Design/Methods:
This study evaluated the performance of Generative Pre-trained Transformer version 4 (GPT-4) language model in NeuroReady®: Board Prep question bank, which is considered representative of the American Board of Psychiatry and Neurology board exam. Four hundred questions were entered into individual GPT-4 chat sessions to avoid memory retention, and the model's accuracy was assessed across various factors using appropriate statistical tests with statistical significance set at a P-value below 0.05.
Results:

With an accuracy rate of 75.0% (N=400, 95% Confidence Interval (CI): 70.5-79.2%), GPT-4 outperformed the average test taker score of 69% and the passing score of 70%. The model's accuracy was not associated with question length (Odds Ratio (OR) = 0.999 per one word increase, 95% CI: 0.993-1.005, P=0.693) but was lower for questions involving images (61.1% versus 78.0%, P=0.003) and those requiring higher levels of thinking (71.7% versus 81.0%, P=0.040). The model's accuracy showed a positive correlation with test taker performance for each question (OR 1.56 for 10% increase in test taker’s accuracy, 95% CI: 1.37-1.78, P<0.001). GPT-4 excelled in specific neurology subsections, such as neuromuscular disorders, pharmacology, and cognitive and behavior disorders.

Conclusions:

While AI has immense potential to assist medical education and clinical decision-making, rigorous verification, validation, and physician supervision are necessary to ensure its accuracy and reliability in the complex field of neurology.

10.1212/WNL.0000000000204763