To develop and evaluate retrieval-augmented large language models (LLMs) for summarizing headache medicine literature, compare their outputs with expert-written summaries, and identify key factors experts value in high-quality summaries.
Summarizing medical literature is essential to guide clinical decision-making. While LLMs show promise for literature summarization, detailed expert evaluations of such summaries in specific domains remain limited.
We constructed a retrieval-augmented generation (RAG) framework using three state-of-the-art LLMs, Sonnet, GPT-4o, and Llama-3.1, for medical literature summarization. A headache specialist created 13 questions: three for prompt optimization and ten for evaluation. Ten headache specialists from different institutions each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries based on correctness, completeness, conciseness and clinical utility, scoring from 1-10 with standardized rubrics provided. They ranked the summaries by preference and indicated which they believed was written by an expert vs LLM.
Fifty evaluations from ten headache specialists were analyzed. Overall, experts written summaries scored the highest in all evaluation metrics and were the most preferred responses, followed by Sonnet, GPT-4o and Llama-3.1. Expert vs AI comparisons showed experts outperformed GPT-4o and Llama on most metrics, with no significant difference compared to Sonnet. Experts correctly distinguished experts- from AI-written summaries 64% (32/50) of the time. Beyond the metrics used, experts valued the quality of references, inclusion of key clinical details, such as medication dosage, synthesis across sources, and incorporation of clinical nuance and experience.
Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, though it was challenging for experts to distinguish between human vs AI-generated ones. We also identified key expert-valued features, beyond standard evaluation metrics, that can guide future refinement of human and AI literature summarization pipelines.