We sought to validate feedback generated by an AI-enabled case-based learning (CBL) platform compared to feedback from expert faculty scorers.
CBL is integrated into many medical curricula. Large language models (LLM) may provide a means to augment standard CBL pedagogical approaches with patient-like interactions. Further, LLMs have the potential to both simulate patient encounters and offer real-time feedback and coaching.
Five LLM-based interactive cases were undertaken by four student investigators to generate twenty cases for feedback. In each case, learners assumed the role of the evaluating clinician, eliciting history, performing a physical exam, and ordering diagnostic testing. Feedback was provided by an LLM and two human faculty experts. The feedback was evaluated using various metrics including word length, reference to provided key clinical points, and two previously validated scores for feedback quality–the QuAL and EFeCT. These scores were completed by two neurologists.
LLM and human expert feedback were similar in terms of word length and number of sentences. In the feedback provided, the LLM commented on 20/20 (100%) aspects of the key learning points, compared to 39/60 (65%) for the human experts. Faculty evaluation of the feedback pertaining to history taking skills (HPI) demonstrated higher QuAL and EFeCT scores for the LLM than for the human experts (P < 0.001). When evaluating assessment and plan (A&P) related feedback, no differences were found between AI and human feedback according to faculty scorers.
LLMs can provide feedback to learners on case-based interactions in a manner comparable to human experts. The difference found in expert ratings between LLM and human feedback for HPI skills, but not A&P skills may result from the latter requiring more nuanced grading. A hybrid framework combining LLM-generated feedback with faculty input may offer high quality, equitably accessible, and timely feedback for students.