Evaluate artificial intelligence (AI; GPT-4) lesion-localization accuracy from clinical presentation.
Modern AI’s Large Language Models such as Generative Pre-trained Transformers (GPTs such as current generation GPT-4) have shown potential for healthcare applications.
History and neurological physical examination (H&P) from 47 published cases were processed by GPT-4 to localize stroke lesions. Prompt was generated by Zero-Shot Chain-of-Thought and Text Classification prompting. GPT-4's outputs on three separate trials for each case were compared to the lesion localization based on imaging. Accuracy, specificity, sensitivity, precision, and F1-score were assessed to show performance on measures of 'single or multiple lesions', ‘side’, and 'brain region'.
GPT-4 successfully processed raw text from the H&P to generate accurately formatted neuroanatomical localization along with clinical reasoning for individual neurological findings. Performance metrics across accuracy, specificity, sensitivity, precision, and F1-score in 'single or multiple lesions' were all 0.72; measures were 0.82, 0.87, 0.72, 0.73, 0.73 for 'side'; 0.94, 0.96, 0.82, 0.81, 0.82 for 'brain region'. Individual class labels within 'brain region’ for cerebral hemisphere, brainstem, cerebellum, cervical spinal cord, thoracic spinal cord showed 0.90, 0.83, 0.38, 0.74, 0.75 for F1-score respectively. Qualitative analysis of errors demonstrated 'brain region' errors primarily due to confounding symptoms (35%), while 'side' errors largely arose from insufficient case description (45%) and inaccurate anatomical knowledge (18%).
GPT-4 can interpret H&P to localize acute stroke lesions. As a potential clinical tool, it could play a role in diagnosis and localization of movement and other disorders that cannot easily be identified by imaging. Further work will allow us to fine-tune GPT-4 to further improve functionality by using structured input data rather than verbatim case reports.