Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
Chia-Chun Chiang1, Alejandro Lozano2, Keiko Ihara1, Ping-Hao Yang1, Carrie Robertson1, Jennifer Stern1, Allan Purdy3, Hsiangkuo Yuan4, Pengfei Zhang5, Yulia Orlova1, Olga Fermo1, Jennifer Hranilovich6, Fred Cohen7, Todd Schwedt1, Serena Yeung-Levy2
1Mayo Clinic, 2Stanford University, 3Dalhousie University, 4Jefferson Headache Center, 5Beth Israel Deaconess Medical Center, 6Children's Hospital Colorado, 7Icahn School of Medicine at Mount Sinai
Objective:

To develop and evaluate retrieval-augmented large language models (LLMs) for summarizing headache medicine literature, compare their outputs with expert-written summaries, and identify key factors experts value in high-quality summaries.

Background:

Summarizing medical literature is essential to guide clinical decision-making. While LLMs show promise for literature summarization, detailed expert evaluations of such summaries in specific domains remain limited.

Design/Methods:

We constructed a retrieval-augmented generation (RAG) framework using three state-of-the-art LLMs, Sonnet, GPT-4o, and Llama-3.1, for medical literature summarization. A headache specialist created 13 questions: three for prompt optimization and ten for evaluation. Ten headache specialists from different institutions each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries based on correctness, completeness, conciseness and clinical utility, scoring from 1-10 with standardized rubrics provided. They ranked the summaries by preference and indicated which they believed was written by an expert vs LLM.

Results:

Fifty evaluations from ten headache specialists were analyzed. Overall, experts written summaries scored the highest in all evaluation metrics and were the most preferred responses, followed by Sonnet, GPT-4o and Llama-3.1. Expert vs AI comparisons showed experts outperformed GPT-4o and Llama on most metrics, with no significant difference compared to Sonnet. Experts correctly distinguished experts- from AI-written summaries 64% (32/50) of the time. Beyond the metrics used, experts valued the quality of references, inclusion of key clinical details, such as medication dosage, synthesis across sources, and incorporation of clinical nuance and experience.

Conclusions:

Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, though it was challenging for experts to distinguish between human vs AI-generated ones. We also identified key expert-valued features, beyond standard evaluation metrics, that can guide future refinement of human and AI literature summarization pipelines.

10.1212/WNL.0000000000215849
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.