Diagnostic Accuracy of Multimodal LLMs in Differentiating Epileptic from Non-epileptic Events in Smartphone Recorded Videos
Anshum Patel1, Sai Krishna Vallamchetla1, adrian safa1, Caroline Tatit1, Alicia Kissinger-Knox1, Mark Roberts2, Olivia Bestic1, William Tatum1, Brin Freund1
1Neurology Department, Mayo Clinic, Florida, 2Department of Information Technology, Mayo Clinic, Rochester, MN
Objective:

We evaluated the accuracy of sequential LLMs in differentiating seizure types from smartphone videos without clinical context.

Background:
Differentiating epileptics from functional seizures is a clinical challenge. While smartphone videos can aid diagnosis, they often require expert review, causing delays. Multimodal large language models (LLMs) may offer a solution, but their diagnostic performance is unstudied.
Design/Methods:

In this prospective diagnostic study at a single tertiary care epilepsy center, we collected 24 smartphone videos from 15 adult patients undergoing evaluation for paroxysmal events. Each video was independently analyzed by four successive multimodal LLMs (Gemini 1.5 Pro, 2.0 Flash, 2.5 Flash, and 2.5 Pro) using a standardized prompt, without access to clinical information. The primary outcome was diagnostic accuracy compared with a gold-standard diagnosis established by video-electroencephalography monitoring. Secondary outcomes included standard diagnostic metrics and an analysis of model-reported confidence scores.

Results:

Of the 24 events, 19 (79.2%) were epileptic and 5 (20.8%) were functional. Diagnostic accuracy improved with successive models: Gemini 1.5 Pro (33.3%), Gemini 2.0 Flash (25.0%), and both Gemini 2.5 Flash and Pro (54.2%). The accuracy of Gemini 2.5 Pro was significantly higher than that of Gemini 1.5 Pro (p=0.01) and Gemini 2.0 Flash (p=0.003). Performance was influenced by video features; for instance, diagnosis was more accurate when videos focused on the upper body/face compared to a whole-body view for Gemini 2.5 Flash (90.0% vs. 28.6%, p=0.004) and Gemini 2.5 Pro (80.0% vs. 35.7%, p=0.04). All models reported high confidence (median score, 8.0-9.0), but these scores were poorly calibrated and did not correlate with diagnostic correctness.

Conclusions:

Successive LLMs show improved yet modest accuracy for seizure classification from video alone highlighting the need for domain-specific fine-tuning and rigorous validation before clinical implementation.

10.1212/WNL.0000000000215770
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.