Comparison of Machine Learning Architectures for Acoustic Analysis of Vowel Articulation Among Parkinson’s Disease Patients
Kotaro Tsutsumi1, Peter Chang2, Sanaz Attaripour1
1Department of Neurology, 2Department of Radiological Sciences, University of California, Irvine
Objective:
Our aim was to develop and compare machine learning (ML) algorithms for identification of Parkinson’s disease (PD) patients via acoustic analysis of vowel articulation.
Background:
Vocal impairments are common among PD patients and characterized with hypokinetic dysarthria, hypophonia, monotony, and hoarseness. We hypothesize that these acoustic biomarkers can be leveraged by ML models to differentiate the voice of those with PD from control.
Design/Methods:
Two public datasets containing PD speech recordings including the NeuroVoz and Italian Parkinson’s Voice/Speech datasets were utilized. Vowel articulation tasks were pooled together. Each recording was truncated to 4 seconds, adjusted to audio frequency of 16 kHz, and processed using time and pitch augmentation. Both raw WAV files and log-mel spectrograms (LMS) were used as inputs into different models to compare relative performances. Three different modeling approaches were utilized: ResNet18, a convolutional neural network (CNN) based architecture; HuBERT, a mixed CNN-Transformer architecture; the Audio Spectrogram Transformer (AST). Original WAV files were used as input into the HuBERT model, while LMS transformed data were used as input into the CNN and AST models. Performance was evaluated after aggregation of metrics over a fivefold cross-validation.
Results:
150 patients (44% female) with mean age of 67.1±9.87 were included. 74 patients (49.3%) had a diagnosis of PD. The HuBERT model demonstrated average area under the curve of receiver operating characteristics (AUC-ROC) of 0.733±0.0429 and accuracy of 0.621±0.0713. The CNN model demonstrated AUC-ROC of 0.839±0.0501 and accuracy of 0.750±0454. The AST demonstrated AUC-ROC of 0.820±0.0548 and accuracy of 0.752±0.0622.
Conclusions:
ML algorithms are capable of accurately classifying PD patients from controls based on vowel articulation. In our cohort, LMS transformed data yielded better performance compared to raw WAV files, and CNN-based models yielded better performance than Transformer-based models. Future works will include more samples and explore various methods that enhance explainability of above models.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.