Evaluating the Accuracy of Artificial Intelligence (AI) Medical Scribe Software Utilization in Telestroke
Lana Prieur1, Mark McDonald2, Theresa Sevilis3
1University of North Carolina at Chapel Hill School of Medicine, 2TeleSpecialists, 3Telespecialists, LLC
Objective:
To measure the performance of an Artificial Intelligence (AI) medical scribe by using Medical Concept Word Error Rate (MC-WER) in telestroke encounters.
Background:
AI scribes are increasingly utilized to reduce documentation burdens on physicians in the outpatient setting. Unfortunately, there is no consensus on the best metrics for evaluating the accuracy and risks of AI scribes. The current study evaluates the Medical Concept Word Error Rate (MC-WER) in assessing the performance of an AI scribe in automating documentation of physician-patient encounters in the hospital setting.
Design/Methods:
Transcripts from 25 telestroke patient encounters were de-identified then processed by an OpenAI LLM. The History of Present Illness (HPI) sections generated by the AI scribe vs provider were compared. Medical Concept Word Omission Rates and Medical Concept Word Addition Rates were calculated. Additionally, the likely diagnoses based on review of the AI scribe and provider-generated HPI’s were compared. The likely diagnosis of the provider HPI was the ground truth. HPI’s from the AI scribe that led to an incorrect diagnosis were compared with HPI’s with the correct diagnosis.
Results:
Five of the 25 AI scribe-generated HPI’s (20%) led to a different diagnosis compared to the provider-generated HPI for the corresponding case. AI scribe-generated HPI’s that resulted in an incorrect diagnosis had a significantly higher Medical Concept Omission Rate than HPI’s that led to the correct diagnosis (p=0.0036). There was no difference in the Medical Concept Addition Rate between groups.
Conclusions:
This study showed the feasibility of utilizing an AI scribe for acute telestroke encounters and provides a framework for how to assess the accuracy of the tool. The Medical Concept Omission Rate may be an important metric when evaluating the accuracy and risks of AI scribes. Further research is needed to validate this preliminary data and further develop the ideal AI scribe for telestroke encounters.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.