Using Natural Language Processing to identify patients with Multiple Sclerosis
Dhruv Jimulia1, Chiraag Lala1, Richard Nicholas2, Rod Middleton3
1Computing, Imperial College London, 2Imperial College Healthcare Trust, 3Swansea University
Objective:
We used text comments from social media sources and the UK MS register (UKMSR) to determine the possibility of identifying MS subjects using natural language processing (NLP).
Background:
Multiple sclerosis (MS) has a wide range of impacts that result in symptoms often being dismissed in the earliest disease phase potentially leading to a delay in diagnosis. Predictive classifiers from natural language processing (NLP) are a potential aid to early identification of symptoms of a range of conditions. These classifiers take as input, patient experience, in text format. Identifying people with MS symptoms earlier could enable them to be directed to the correct resources to confirm a diagnosis.
Design/Methods:
Classifiers were trained using combined social media text data with data from the UKMSR, a database of people clinically diagnosed with MS, where free text entries are collected about their unique experiences. Two control groups were identified: generic social media users and people with self-reported diagnoses of diabetes on social media. Words related to the diagnosis and relevant treatments were removed. We explored three ways to represent text as word vectors and trained seven different machine learning models, including logistic regression, k-nearest neighbors, Naïve Bayes and deep learning approaches.
Results:
6021 comments from 3037 patients with MS, 6021 comments from 3037 diabetes patients and 6021 comments from 3037 generic social media users were gathered. From the seven models, we achieved a maximum accuracy of 94.3% when classifying experiences of MS patients against generic social media users and a maximum accuracy of 92.4% against diabetes patients.
Conclusions:
The results suggest that NLP-based methods used here can identify people with MS with high accuracy. Further work on more datasets and conditions is required to confirm the utility of these findings but they have the potential to assist accelerating the diagnosis of MS.