AI LLM Decision Support in Epilepsy Surgery: A Real-world Retrospective Concordance Study Comparing GPT-5 to Expert Case Conference Consensus
Yixin Jia1, Dorris Luong2, Shahnawaz Karim3, Ning Zhong2
1Barnard College, 2Neurology, Kaiser Permanente Sacramento Medical Center, 3Neurology, Kaiser Permanente South Sacramento Medical Center
Objective:
To evaluate the concordance between artificial intelligence (AI) large language model (LLM) recommendations and multidisciplinary epilepsy surgery case conference (EPCC) consensus, and to explore the potential role of LLMs in supporting epilepsy surgery decisions in resource-constrained community settings.
Background:
Multidisciplinary EPCCs remain the gold standard for presurgical decision-making in drug-resistant epilepsy (DRE). LLMs, capable of integrating multimodal data, may serve as adjunct decision-support tools by synthesizing complex, unstructured clinical information. To date, no study has systematically compared LLM-generated recommendations with EPCC consensus or examined how concordance varies by model type or user expertise.
Design/Methods:
Standardized case vignettes—including clinical, EEG, imaging, and neuropsychological data—from patients with DRE evaluated at our center were submitted to multiple open-access LLMs. Model outputs were compared with EPCC consensus across four domains: primary recommendation, lateralization, invasive monitoring targets, and ancillary testing. Concordance was scored, and Cohen’s κ was calculated for inter-model and inter-user comparisons.
Results:
Initial pilot testing (10 cases) compared general-purpose models (GPT-5, Gemini, Meta AI) with a domain-specific model (OpenEvidence). OpenEvidence achieved higher concordance (κ = 0.42 ± 0.22) than the general-purpose models. The focused GPT-5 analysis (30 vignettes) included exploratory inter-user comparisons between epilepsy expert-guided and novice-guided use. GPT-5 achieved identical overall concordance with EPCC consensus (56.7%) for both users, with moderate agreement between expert- and novice-guided GPT outputs (κ = 0.54 ± 0.17, p < 0.01). However, both configurations demonstrated reduced confidence in lateralization, invasive targeting, and ancillary test recommendations.
Conclusions:
Open-access LLMs can partially replicate multidisciplinary EPCC decision-making, achieving modest concordance across key domains. While not a substitute for expert consensus, LLMs may provide complementary decision support, particularly in community or resource-limited settings. The minimal inter-user differences observed with GPT-5 suggest practical potential for standardized use without extensive pre-training, warranting larger prospective validation, which is needed to determine clinical utility and safety.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.