Using Large Language Models to Identify Candidates for Pediatric Epilepsy Surgery
Sarah Chowdhury1, Nuran Golbasi1, Alexander Zhao1, Carson Gundlach1, Ashwin Mahesh1, Natasha Basma2, Zachary Grinspan2
1MD Program, 2Pediatrics and Population Health Sciences, Weill Cornell Medicine
Objective:
Evaluate two large language models’ abilities to identify pediatric cases for epilepsy surgery, and recommend palliative or definitive procedures.
Background:
Only one-third of eligible children receive epilepsy surgery, with low rates in underserved populations. Surgical procedures include definitive procedures targeting seizure freedom (resection, ablation, disconnection) and palliative procedures aiming to reduce seizure frequency and/or severity. Large language models (LLMs) may be able to extract information from clinical notes and recommend surgical treatments.
Design/Methods:
We conducted a retrospective observational cohort study. We compiled surgical eligibility criteria via literature review and rapid qualitative analysis with input from 7 pediatric epileptologists. Notes and vignettes of children with refractory epilepsy were manually classified into “surgical” and “not surgical”; then “surgical” into “definitive” or “palliative”. PaLM 2 Bison (Google, Mountain View, CA) and GPT-4 (OpenAI, San Francisco, CA) LLMs were prompted using zero- and few-shot approaches. Performance was evaluated through sensitivity, specificity, and positive (PPV) and negative (NPV) predictive values.
Results:
Literature review and interviews indicated that seizures refractory to 2 or more anti-seizure medications would make a child eligible for epilepsy surgery. Factors favoring definitive surgery included concordant imaging, neuropsychological testing, and semiology; preservation of eloquent areas; and certain genetic mutations. For 24 cases, LLMs identified surgical candidates with >90% sensitivity using all approaches. The few-shot approach had the best overall performance, with 88.9% specificity for both LLMs; PPV of 93.3% for Bison, 93.8% for GPT-4; and NPV of 88.9% for Bison and 100% for GPT-4. Neither LLM effectively distinguished definitive vs palliative. Bison often memorized prompts and presented fabricated data (“hallucinations”), reducing specificity.
Conclusions:
Preliminary data suggest that LLMs can identify candidates for epilepsy surgery with >90% sensitivity. Performance for LLMs was highest with a few-shot approach. LLMs performed less well in distinguishing candidacy for definitive versus palliative procedures, and were impacted by hallucination and memorization.
10.1212/WNL.0000000000205207