Large Language Models Provide Appropriate and Timely Narrative Feedback as Compared to Experts for Neurology Case-based Learning
Carolyn Qian1, Christina Gao2, Haelynn Gim1, Sang-O Park1, Kelly Hou2, Edward Kong1, Benjamin Cook2, Jasmin Le2, Brandon Stretton2, John Maddison3, Liam McCoy4, Luke Collins2, Andrew Vanlint2, Rudy Goh2, Ashley Paul5, Haatem Reda1, Tamara Kaplan1, Hannah Fruitman6, Sasha Severin6, Atikul Miah6, Aye Thant7, Rani Priyanka Vasireddy8, Doris Kung9, Adam Karp6, Galina Gheihman1, Stephen Bacchi1
1Harvard Medical School, 2Adelaide Medical School, 3Lyell McEwin Hospital, 4Massachusetts Institute of Technology, 5Johns Hopkins University, 6New York Medical College, 7Washington University in St. Louis, School of Medicine, 8The University of Texas at Tyler Health Science Center, 9Baylor College of Medicine
Objective:

We sought to validate feedback generated by an AI-enabled case-based learning (CBL) platform compared to feedback from expert faculty scorers.

Background:

CBL is integrated into many medical curricula. Large language models (LLM) may provide a means to augment standard CBL pedagogical approaches with patient-like interactions. Further, LLMs have the potential to both simulate patient encounters and offer real-time feedback and coaching.

Design/Methods:

Five LLM-based interactive cases were undertaken by four student investigators to generate twenty cases for feedback. In each case, learners assumed the role of the evaluating clinician, eliciting history, performing a physical exam, and ordering diagnostic testing. Feedback was provided by an LLM and two human faculty experts. The feedback was evaluated using various metrics including word length, reference to provided key clinical points, and two previously validated scores for feedback quality–the QuAL and EFeCT. These scores were completed by two neurologists.

Results:

LLM and human expert feedback were similar in terms of word length and number of sentences. In the feedback provided, the LLM commented on 20/20 (100%) aspects of the key learning points, compared to 39/60 (65%) for the human experts. Faculty evaluation of the feedback pertaining to history taking skills (HPI) demonstrated higher QuAL and EFeCT scores for the LLM than for the human experts (P < 0.001). When evaluating assessment and plan (A&P) related feedback, no differences were found between AI and human feedback according to faculty scorers.

Conclusions:

LLMs can provide feedback to learners on case-based interactions in a manner comparable to human experts. The difference found in expert ratings between LLM and human feedback for HPI skills, but not A&P skills may result from the latter requiring more nuanced grading. A hybrid framework combining LLM-generated feedback with faculty input may offer high quality, equitably accessible, and timely feedback for students.

10.1212/WNL.0000000000216667
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.