2026 American Academy of Neurology Abstract Website

Carolyn Qian¹, Christina Gao², Haelynn Gim¹, Sang-O Park¹, Kelly Hou², Edward Kong¹, Benjamin Cook², Jasmin Le², Brandon Stretton², John Maddison³, Liam McCoy⁴, Luke Collins², Andrew Vanlint², Rudy Goh², Ashley Paul⁵, Haatem Reda¹, Tamara Kaplan¹, Hannah Fruitman⁶, Sasha Severin⁶, Atikul Miah⁶, Aye Thant⁷, Rani Priyanka Vasireddy⁸, Doris Kung⁹, Adam Karp⁶, Galina Gheihman¹, Stephen Bacchi¹
¹Harvard Medical School, ²Adelaide Medical School, ³Lyell McEwin Hospital, ⁴Massachusetts Institute of Technology, ⁵Johns Hopkins University, ⁶New York Medical College, ⁷Washington University in St. Louis, School of Medicine, ⁸The University of Texas at Tyler Health Science Center, ⁹Baylor College of Medicine

Objective:

We sought to validate feedback generated by an AI-enabled case-based learning (CBL) platform compared to feedback from expert faculty scorers.

Background:

CBL is integrated into many medical curricula. Large language models (LLM) may provide a means to augment standard CBL pedagogical approaches with patient-like interactions. Further, LLMs have the potential to both simulate patient encounters and offer real-time feedback and coaching.

Design/Methods:

Five LLM-based interactive cases were undertaken by four student investigators to generate twenty cases for feedback. In each case, learners assumed the role of the evaluating clinician, eliciting history, performing a physical exam, and ordering diagnostic testing. Feedback was provided by an LLM and two human faculty experts. The feedback was evaluated using various metrics including word length, reference to provided key clinical points, and two previously validated scores for feedback quality–the QuAL and EFeCT. These scores were completed by two neurologists.

Results:

LLM and human expert feedback were similar in terms of word length and number of sentences. In the feedback provided, the LLM commented on 20/20 (100%) aspects of the key learning points, compared to 39/60 (65%) for the human experts. Faculty evaluation of the feedback pertaining to history taking skills (HPI) demonstrated higher QuAL and EFeCT scores for the LLM than for the human experts (P < 0.001). When evaluating assessment and plan (A&P) related feedback, no differences were found between AI and human feedback according to faculty scorers.

Conclusions:

LLMs can provide feedback to learners on case-based interactions in a manner comparable to human experts. The difference found in expert ratings between LLM and human feedback for HPI skills, but not A&P skills may result from the latter requiring more nuanced grading. A hybrid framework combining LLM-generated feedback with faculty input may offer high quality, equitably accessible, and timely feedback for students.