Benchmarking Large Language Models for Neurological Imaging Interpretation Using a Multiple Sclerosis Lesion Segmentation Dataset
Isheeta Gupta1, Vishrut Thaker2
1Washington University in St. Louis, 2Emory University
Objective:

To comprehensively benchmark the ability of general-purpose LLMs to interpret structured MRI lesion data from a standardized multiple sclerosis (MS) lesion segmentation dataset (MSLesSeg), including lesion description and clinical interpretation.

Background:

While some models (e.g., GPT-4) have exhibited impressive performance on clinical reasoning tasks, the ability of LLMs to interpret complex lesion data derived from neuroimaging has not yet been comprehensively investigated. MSLesSeg is a standardized expert-annotated MS lesion segmentation dataset that includes lesion metadata for 75 cases of MS that could serve as an ideal basis to benchmark how current LLMs handle structured neurological imaging information.

Design/Methods:

Three widely used general-purpose LLMs like GPT, Claude, and Gemini, will be evaluated using standardized text prompts generated from MSLesSeg. Each case will include structured lesion data (volume, count, anatomical distribution across periventricular, juxtacortical, infratentorial, and spinal regions). Models will be tasked with: (1) classifying lesion patterns as typical or atypical for MS, (2) generating structured radiology-style lesion descriptions. Evaluation will include accuracy and F1 scores for classification tasks, and hallucination/error rate analysis. Intra-model consistency across repeated prompts will also be examined.


Results:

We anticipate differences in performance across models, with stronger accuracy for typical MS lesion patterns compared to atypical or complex cases. Hallucination rates are expected to be nontrivial, particularly for infratentorial lesions. The analysis will provide comparative benchmarking data on model reliability, consistency, and interpretability. The analysis is ongoing, and results will be presented at the Annual Meeting. 

Conclusions:
This study will establish one of the first benchmarks for evaluating general-purpose LLMs in the context of structured neurological imaging data. By leveraging a publicly available, expert-annotated MS lesion segmentation dataset, this work aims to provide actionable insights into the capabilities and limitations of current LLMs in clinical neuroimaging interpretation.
10.1212/WNL.0000000000217170
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.