TY - JOUR
T1 - Comparative evaluation of artificial intelligence chatbots in answering electroencephalography-related questions
AU - Proença, Soraia
AU - Soares, Joana Isabel
AU - Parra, Joana
AU - Maia, Gisela
AU - Silva, Sílvia
AU - Leite, Juliana
AU - Beniczky, Sándor
AU - Jesus-Ribeiro, Joana
N1 - Publisher Copyright:
© 2025 International League Against Epilepsy.
PY - 2025/12/16
Y1 - 2025/12/16
N2 - Objective: As large language models (LLMs) become more accessible, they may be used to explain challenging EEG concepts to nonspecialists. This study aimed to compare the accuracy, completeness, and readability of EEG-related responses from three LLM-based chatbots and to assess inter-rater agreement. Methods: One hundred questions, covering 10 EEG categories, were entered into ChatGPT, Copilot, and Gemini. Six raters from the clinical neurophysiology field (two physicians, two teachers, and two technicians) evaluated the responses. Accuracy was rated on a 6-point scale, completeness on a 3-point scale, and readability was assessed using the Automated Readability Index (ARI). We used a repeated-measures ANOVA for group differences in accuracy and readability, the intraclass correlation coefficient (ICC) for inter-rater reliability, and a two-way ANOVA, with chatbot and raters as factors, for completeness. Results: Total accuracy was significantly higher for ChatGPT (mean ± SD 4.54 ±.05) compared with Copilot (mean ± SD 4.11 ±.08) and Gemini (mean ± SD 4.16 ±.13) (p <.001). ChatGPT's lowest performance was in normal variants and patterns of uncertain significance (mean ± SD 3.10 ±.14), while Copilot and Gemini performed lowest in ictal EEG patterns (mean ± SD 2.93 ±.11 and 3.37 ±.24, respectively). Although inter-rater agreement for accuracy was excellent among physicians (ICC =.969) and teachers (ICC =.926), it was poor for technicians in several EEG categories. ChatGPT achieved significantly higher completeness scores than Copilot (p <.001) and Gemini (p =.01). ChatGPT text (ARI − mean ± SD 17.41 ± 2.38) was less readable than Copilot (ARI −mean ± SD 11.14 ± 2.60) (p <.001) and Gemini (ARI − mean ± SD 14.16 ± 3.33). Significance: Chatbots achieved relatively high accuracy, but not without flaws, emphasizing that the information provided requires verification. ChatGPT outperformed the other chatbots in accuracy and completeness, though at the expense of readability. The lower inter-rater agreement among technicians may reflect a gap in standardized training or practical experience, potentially impacting the consistency of EEG-related content assessment.
AB - Objective: As large language models (LLMs) become more accessible, they may be used to explain challenging EEG concepts to nonspecialists. This study aimed to compare the accuracy, completeness, and readability of EEG-related responses from three LLM-based chatbots and to assess inter-rater agreement. Methods: One hundred questions, covering 10 EEG categories, were entered into ChatGPT, Copilot, and Gemini. Six raters from the clinical neurophysiology field (two physicians, two teachers, and two technicians) evaluated the responses. Accuracy was rated on a 6-point scale, completeness on a 3-point scale, and readability was assessed using the Automated Readability Index (ARI). We used a repeated-measures ANOVA for group differences in accuracy and readability, the intraclass correlation coefficient (ICC) for inter-rater reliability, and a two-way ANOVA, with chatbot and raters as factors, for completeness. Results: Total accuracy was significantly higher for ChatGPT (mean ± SD 4.54 ±.05) compared with Copilot (mean ± SD 4.11 ±.08) and Gemini (mean ± SD 4.16 ±.13) (p <.001). ChatGPT's lowest performance was in normal variants and patterns of uncertain significance (mean ± SD 3.10 ±.14), while Copilot and Gemini performed lowest in ictal EEG patterns (mean ± SD 2.93 ±.11 and 3.37 ±.24, respectively). Although inter-rater agreement for accuracy was excellent among physicians (ICC =.969) and teachers (ICC =.926), it was poor for technicians in several EEG categories. ChatGPT achieved significantly higher completeness scores than Copilot (p <.001) and Gemini (p =.01). ChatGPT text (ARI − mean ± SD 17.41 ± 2.38) was less readable than Copilot (ARI −mean ± SD 11.14 ± 2.60) (p <.001) and Gemini (ARI − mean ± SD 14.16 ± 3.33). Significance: Chatbots achieved relatively high accuracy, but not without flaws, emphasizing that the information provided requires verification. ChatGPT outperformed the other chatbots in accuracy and completeness, though at the expense of readability. The lower inter-rater agreement among technicians may reflect a gap in standardized training or practical experience, potentially impacting the consistency of EEG-related content assessment.
KW - artificial intelligence
KW - ChatGPT
KW - Copilot
KW - electroencephalography
KW - Gemini
KW - large language model
UR - http://www.scopus.com/inward/record.url?scp=105024907491&partnerID=8YFLogxK
U2 - 10.1002/epd2.70156
DO - 10.1002/epd2.70156
M3 - Journal article
C2 - 41399926
AN - SCOPUS:105024907491
SN - 1294-9361
JO - Epileptic Disorders
JF - Epileptic Disorders
ER -