Artificial intelligence (AI) has revolutionized various industries, with notable impacts in areas like autonomous vehicles, and image recognition. Despite AI's prevalence, its capability in medical clinical decision-making remains understudied. ChatGPT, a LLM developed by OpenAI, has shown promise in the medical field, even passing exams like the USMLE. However, its capacity in comparison to human test-takers, especially for the neurology board exam, remains an area of exploration.
With an accuracy rate of 75.0% (N=400, 95% Confidence Interval (CI): 70.5-79.2%), GPT-4 outperformed the average test taker score of 69% and the passing score of 70%. The model's accuracy was not associated with question length (Odds Ratio (OR) = 0.999 per one word increase, 95% CI: 0.993-1.005, P=0.693) but was lower for questions involving images (61.1% versus 78.0%, P=0.003) and those requiring higher levels of thinking (71.7% versus 81.0%, P=0.040). The model's accuracy showed a positive correlation with test taker performance for each question (OR 1.56 for 10% increase in test taker’s accuracy, 95% CI: 1.37-1.78, P<0.001). GPT-4 excelled in specific neurology subsections, such as neuromuscular disorders, pharmacology, and cognitive and behavior disorders.
While AI has immense potential to assist medical education and clinical decision-making, rigorous verification, validation, and physician supervision are necessary to ensure its accuracy and reliability in the complex field of neurology.