Inter-Rater Reliability of EEG-Based Encephalopathy Grading
Anika Zahoor1, Jayme Banks1, Ryan Tesh1, Christine Eckhardt1, Haoqi Sun1, Adam Greenblatt2, Aline Herlopian 3, Ioannis Karakis4, Roohi Katyal5, Chetan Nayak2, Marcus Ng6, Jonathan Williams2, Irfan Sheikh1, Fábio Nascimento2, Michael Westover1
1Neurology, Massachusetts General Hospital, 2Neurology, Washington University in St. Louis, 3Neurology, Yale University, 4Neurology, Emory University, 5Neurology, Ochsner LSU Health Shreveport, 6Neurology, University of Manitoba
Objective:

To measure inter-rater reliability of experts in assessing encephalopathy severity using the VE-CAM-S grading system.

Background:

The VE-CAM-S (Visual EEG Confusion Assessment Method – Severity) scale quantifies encephalopathy severity based on EEG features. However, inter-rater reliability of experts using the scale has yet to be assessed.

Design/Methods:

We created an online test with thirty-two 15-second EEG samples. Each question asked users to indicate the presence/absence of each of 29 EEG features, 11 of which were used in the VE-CAM-S. Gold-standard was based on consensus of 3 authors (IS, FN, MBW). Ten experts from 6 institutions participated. We quantified performance by average spearman correlation of VE-CAM-S scores with the gold standard, and average sensitivity/specificity. We performed a qualitative analysis to identify errors in recognizing EEG features that most affected VE-CAM-S scores.

Results:

The average [95%CI] correlation between VE-CAM-S scores with the gold standard was 0.73 [0.59-0.86]. Specificity was very high (>90%) for all but generalized delta (77%). Sensitivity was high (>70%) for all but brief generalized attenuations (69%), generalized periodic discharges (67%), generalized theta (63%), BIRDS (57%), generalized alpha (57%), extreme delta brushes/EDB (50%), and generalized beta (50%). Probable reasons for errors were subtlety of some findings; confusing some findings (e.g., generalized beta vs. myogenic artifact, burst suppression vs. brief generalized attenuations); failure to correctly recognize BIRDs (mislabeled as focal IEDs) and EDB (mislabeled as GRDA). The largest errors occurred when experts missed or falsely identified features that carry higher weight in the VE-CAM-S scoring rubric.

Conclusions:

Expert agreement in VE-CAM-S scoring is high. Error analysis identified several ways to improve future versions, including breaking high-stakes features into smaller parts; creating a “cheat sheet” with scored examples to allow scorers to choose the closest match; and designing teaching materials to help scorers recognize subtle variations of high-stakes patterns.  

10.1212/WNL.0000000000202773