To address the increasing demand for accurate EEG-based epileptic seizure detection, we propose EEG-VL, a novel vision–language framework that treats EEG signals as visual patterns and integrates them with large language models to improve seizure detection.
A pretrained EfficientNet encoder is used to extract abstract visual features from EEG representations, which are embedded into structured prompts and processed by the Qwen language model. This design synergistically combines the spatial modeling capabilities of convolutional networks with the semantic reasoning strengths of large language models. We conduct experiments on two widely used benchmark datasets for seizure detection, the TUH EEG Seizure Corpus (TUSZ) and the CHB-MIT Scalp EEG Database. To address the class imbalance commonly present in seizure datasets, we adopt a logit adjustment strategy based on label distribution priors.
We employed several classic EEG networks for performance comparisons, including EEGNet, EEG-TCNet, ShallowCNN-Tran, EEGNet-Tran and EEG2ViT. Extensive experiments on the TUSZ and CHB-MIT datasets demonstrate that EEG-VL achieves state-of-the-art performance. On TUSZ, our model attains an AUROC of 0.9466 and an AUPRC of 0.7599, surpassing previous best results by 0.80% and 8.19%, respectively. On CHB-MIT, EEG-VL achieves average AUROC of 0.91 and AUPRC of 0.47, superior previous best performance by 1% and 14.6%.
Our study underscores the potential of the proposed vision–language paradigm for robust, scalable, and clinically applicable EEG-based seizure detection. By integrating a pretrained visual encoder with a large language model, EEG-VL exhibits excellent performance on multiple public datasets, demonstrating promising potential for clinical adoption.