2026 American Academy of Neurology Abstract Website

Objective:

We evaluated the accuracy of sequential LLMs in differentiating seizure types from smartphone videos without clinical context.

Background:

Differentiating epileptics from functional seizures is a clinical challenge. While smartphone videos can aid diagnosis, they often require expert review, causing delays. Multimodal large language models (LLMs) may offer a solution, but their diagnostic performance is unstudied.

Design/Methods:

In this prospective diagnostic study at a single tertiary care epilepsy center, we collected 24 smartphone videos from 15 adult patients undergoing evaluation for paroxysmal events. Each video was independently analyzed by four successive multimodal LLMs (Gemini 1.5 Pro, 2.0 Flash, 2.5 Flash, and 2.5 Pro) using a standardized prompt, without access to clinical information. The primary outcome was diagnostic accuracy compared with a gold-standard diagnosis established by video-electroencephalography monitoring. Secondary outcomes included standard diagnostic metrics and an analysis of model-reported confidence scores.

Results:

Of the 24 events, 19 (79.2%) were epileptic and 5 (20.8%) were functional. Diagnostic accuracy improved with successive models: Gemini 1.5 Pro (33.3%), Gemini 2.0 Flash (25.0%), and both Gemini 2.5 Flash and Pro (54.2%). The accuracy of Gemini 2.5 Pro was significantly higher than that of Gemini 1.5 Pro (p=0.01) and Gemini 2.0 Flash (p=0.003). Performance was influenced by video features; for instance, diagnosis was more accurate when videos focused on the upper body/face compared to a whole-body view for Gemini 2.5 Flash (90.0% vs. 28.6%, p=0.004) and Gemini 2.5 Pro (80.0% vs. 35.7%, p=0.04). All models reported high confidence (median score, 8.0-9.0), but these scores were poorly calibrated and did not correlate with diagnostic correctness.

Conclusions:

Successive LLMs show improved yet modest accuracy for seizure classification from video alone highlighting the need for domain-specific fine-tuning and rigorous validation before clinical implementation.