A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records
Arjun Singh1, Marta Fernandes1, Haoqi Sun3, Carson Quinn1, George Harrold4, Rebecca Gillani1, Sarah Turbett2, Sudeshna Das1, M. Westover3, Shibani Mukerji1
1Department of Neurology, 2Department of Medicine, Massachusetts General Hospital, 3Department of Neurology, Beth Israel Deaconess Medical Center, 4Department of Neurology, Brigham and Women's Hospital
Objective:

To develop an Automated Electronic Health Record (EHR) Phenotyping (AEP) model using machine learning (ML), to enhance neuroinfectious disease (NID) cohort selection.

Background:

Identifying NID cases using manual chart review or billing codes is time-consuming and suboptimal. EHR-based ML models for NID identification have yet to be explored.

Design/Methods:
Clinical notes from patients with a lumbar puncture were obtained using the EHR of an academic hospital network, with half having NID-related ICD-9/10 codes. Six physicians with NID expertise manually reviewed 500 charts each, to generate the ground truth, where uncertain diagnoses were discarded. Regular expressions were developed to match NID keywords, and extracted texts were converted into bag-of-words representations using 1,3 n-grams. Notes were randomly split into training (80%), and hold-out testing (20%) sets. Feature selection was performed using a variance threshold of 0.075. An extreme gradient boosting (XGBoost) model classified NID cases, and performance was assessed on the testing set using the Area Under the Receiver Operating Curve (AUROC) and the Precision-Recall Curve (AUPRC); 95% confidence intervals (CI) were obtained using bootstrapping.
Results:

This cohort included 2,469 patients, with 2,956 notes from January 2010 to September 2023. The mean age (standard deviation) was 58.2 years (19.6), 55% were women, 77% White, and 83% non-Hispanic. A total of 15.9% (466/2,469) were confirmed NID cases, and only 31.0% (452/1,460) of notes with NID-related ICD codes were deemed positive for a NID. Of the initial 289,377 features, 612 were selected, with the most significant features being “meningitis,” “encephalitis,” “CSF,” “viral,” and “neurosyphilis.” The XGBoost model classified NID cases with an AUROC of 0.95 (95% CI: 0.92-0.97) and an AUPRC of 0.80 (95% CI: 0.71-0.88).

Conclusions:

Our ML-driven AEP model accurately identifies NID cases using clinical notes, thereby enhancing efficiency in NID research and cohort generation. Future studies incorporating multiple EHRs are needed to ensure generalizability.

10.1212/WNL.0000000000204782