Before You Judge Someone's Work: Accuracy and Limitations of AI-generated Text Detectors in Academic Neurology
Sai Krishna Vallamchetla1
1Neurology, Mayo Clinic, Florida, USA
Objective:
To review the performance and limitations of artificial intelligence(AI) text detectors in the context of academic neurology.
Background:
The release of large language models like ChatGPT in late 2022 led to unprecedented use of AI in academic writing. In neurology education and research, this surge raised concerns about undisclosed AI-generated content in residency personal statements and manuscripts. Medical editors and organizations have called for tools to detect AI-written text as a means to uphold academic integrity and authenticity.
Design/Methods:
A literature search was conducted focusing on studies and reports(post-2022) about AI-generated text detection. Data was extracted on the prevalence of AI-generated content in academic submissions and on AI-detector tool performance. Evidence of detector accuracy (true/false positives and negatives) and examples of evasion or misclassification were compiled, with particular attention to medical context.
Results:
AI-generated content in academic writing has increased notably after 2022. For example, one analysis found 2023 conference abstracts had roughly twice the odds of containing AI text compared to 2021. Detection tool accuracy varies widely. Some detectors achieved high specificity (~90%) but moderate sensitivity (~65%), translating to frequent false negative (AI-written text missed) and occasional false positives (human text flagged). Tools relying on linguistic “perplexity” were easily thwarted by paraphrasing or slight stylistic changes. Notably, detectors have erroneously flagged authentic writings, GPTZero even labeled the U.S. Constitution as AI-generated. Bias against non-native English writing was evident and over half of essays by non-native writers were misclassified as AI in one study. Such vulnerabilities underscore that no current detector is foolproof in isolation.
Conclusions:
AI text detectors can assist in identifying possible AI-generated content in neurology academia, but their limitations are significant. They should be viewed as support tools rather than definitive arbiters of authenticity. Given the risks of false accusations and missed AI content, human judgment and contextual evaluation remain essential.
Disclaimer: Abstracts were not reviewed by Neurology® and do not reflect the views of Neurology® editors or staff.