COPE: Chain-of-thought Prediction Engine for Open-source Large Language Model Based Stroke Outcome Prediction from Clinical Notes
Yongkai Liu1, Helena Feng1, Bin Jiang1, Max Wintermark3, David Liebeskind4, Michael Moseley1, Maarten Lansberg2, Gregory Albers2, Jeremy Heit1, Greg Zaharchuk1
1Department of Radiology, 2Department of Neurology, Stanford University, 3Department of Neuroradiology, University of Texas MD Anderson Center, 4Neurovascular Imaging Research Core at UCLA
Objective:
To evaluate a Chain-of-thought Outcome Prediction Engine (COPE)—a reasoning-enhanced LLM framework—for predicting acute ischemic stroke (AIS) outcomes, and to compare its performance with traditional ML models, Clinical BERT, and GPT-4.1.
Background:
Predicting patient outcomes is central to personalized care. While clinical notes such as discharge summaries contain rich context, their unstructured format limits use in traditional models. Advances in large language models (LLMs) offer new ways to harness this data.
Design/Methods:
We analyzed 464 AIS patients from Stanford University Hospital (2010–2023) with discharge summaries and 90-day modified Rankin Scale (mRS, 0–6) outcomes. COPE uses a two-step Chain-of-Thought (CoT) framework with sequential LLaMA-3–8B models: one generates reasoning, and the other predicts the mRS score. Performance was compared with Clinical BERT, a variable-based support vector machine (Clinical SVM), and GPT-4.1. Ablation studies assessed the impact of the reasoning component and individual discharge-summary sections. Model performance was evaluated by mean absolute error (MAE), percentage within 1 mRS point (±1 ACC), and exact accuracy (ACC).
Results:
COPE achieved an MAE of 1.00 (95% CI: 0.91–1.08), ±1 ACC of 75% (71–79%), and exact ACC of 33% (29–38%), matching GPT-4.1 [MAE: 1.00 (0.91–1.08), ±1 ACC: 78% (74–82), ACC: 32% (27–36); p = 1.00, 0.17, 0.62] and outperforming Clinical BERT [MAE: 1.28 (1.17–1.38), ±1 ACC: 62% (58–67), ACC: 28% (24–32); p < 0.001, p < 0.001, p = 0.05] and Clinical SVM [MAE: 1.28 (1.18–1.38), ±1 ACC: 61% (56–66), ACC: 27% (23–31); p < 0.001, p < 0.001, p = 0.03]. It also surpassed its non-reasoning variant [MAE: 1.28 (1.19–1.38), ±1 ACC: 64% (60–69%), ACC: 23% (19–28%)]. Text ablation showed the largest drop when Medications and Discharge & Follow-up Summary sections were removed.
Conclusions:
COPE, a reasoning-enhanced dual-LLM framework, matched GPT-4.1 and outperformed traditional models, offering an accurate, interpretable, privacy-preserving approach to stroke outcome prediction from unstructured text.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.