Large Language Models to Infer Depression in Patients with Neurological Conditions: Applications and Limitations
Maya Julian-Kwong1, Shane Poole1, Kyra Henderson1, Jaeleene Wijangco1, Nikki Sisodia2, Chu-Yueh Guo3, Riley Bove1
1University of California, San Francisco, 2University of California San Francisco, 3UCSF Medical Center
Objective:
To develop a large language model (LLM) prompt capable of inferring a patient’s categorization of depression from a singular clinical note written by their multiple sclerosis (MS) neurologist.

Background:
LLMs are proving invaluable in organizing electronic health records (EHRs) by facilitating the extraction of discrete variables from unstructured data. In the care of individuals with chronic neurological conditions such as MS, the timely recognition and treatment of depression represents a vexing problem, which could be supported by LLM applications.
Design/Methods:
This single-center retrospective study compared three depression indicators from single routine neurology visits (n=278 adults with MS). (1) A patient-reported outcome [PRO: Hospital Anxiety and Depression Scale (HADS-D) or Patient Health Questionnaire-9 (PHQ9)] categorized based on severity. (2) A manual annotation of the neurologist’s impression (depression: present, absent, no mention). (3) An LLM prompt developed utilizing an institutionally secure version of ChatGPT4, refined in five successive rounds, to detect depression in the neurologist’s note.
Results:
Overall, the LLM prompt detected depression in 60.4% of notes (168/278). When compared to the neurologist’s impression, accuracy was 84.4% (97.3% sensitivity; 68.3% specificity; 79.3% positive predictive value; 95.3% negative predictive value). The major discrepancy arose from the prompt’s additional inference of prior/treated depression (based on associated symptoms, history, and medications) even without explicit neurologist mention. When the prompt and neurologist impression disagreed, the prompt tended to match the PRO (61.9%). The neurologist’s impression of depression was “absent” or “no mention” in 59.1%  (91/154) visits with PROs indicating depression.
Conclusions:
The LLM depression prompt was highly sensitive to a neurologist’s documentation of depression in unstructured, under-utilized, clinical notes; it, additionally, inferred both present/treated depression from other note components. Potential applications include quality improvement initiatives aiming to improve depression care on a cohort level. Discrepancies observed between neurologists’ notes and patient self-reports underscore the importance of systematic assessments and documentation.
10.1212/WNL.0000000000212379
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.