TY - JOUR
T1 - Quality of Information Provided by Artificial Intelligence Chatbots Surrounding the Management of Vestibular Schwannomas
T2 - A Comparative Analysis Between ChatGPT-4 and Claude 2
AU - Borsetto, Daniele
AU - Sia, Egidio
AU - Axon, Patrick
AU - Donnelly, Neil
AU - Tysome, James R
AU - Anschuetz, Lukas
AU - Bernardeschi, Daniele
AU - Capriotti, Vincenzo
AU - Caye-Thomasen, Per
AU - West, Niels Cramer
AU - Erbele, Isaac D
AU - Franchella, Sebastiano
AU - Gatto, Annalisa
AU - Hess-Erga, Jeanette
AU - Kunst, Henricus P M
AU - Marinelli, John P
AU - Mannion, Richard
AU - Panizza, Benedict
AU - Trabalzini, Franco
AU - Obholzer, Rupert
AU - Vaira, Luigi Angelo
AU - Polesel, Jerry
AU - Giudici, Fabiola
AU - Carlson, Matthew L
AU - Tirelli, Giancarlo
AU - Boscolo-Rizzo, Paolo
N1 - Copyright © 2025, Otology & Neurotology, Inc.
PY - 2025/4/1
Y1 - 2025/4/1
N2 - OBJECTIVE: To examine the quality of information provided by artificial intelligence platforms ChatGPT-4 and Claude 2 surrounding the management of vestibular schwannomas.STUDY DESIGN: Cross-sectional.SETTING: Skull base surgeons were involved from different centers and countries.INTERVENTION: Thirty-six questions regarding vestibular schwannoma management were tested. Artificial intelligence responses were subsequently evaluated by 19 lateral skull base surgeons using the Quality Assessment of Medical Artificial Intelligence (QAMAI) questionnaire, assessing "Accuracy," "Clarity," "Relevance," "Completeness," "Sources," and "Usefulness."MAIN OUTCOME MEASURE: The scores of the answers from both chatbots were collected and analyzed using the Student t test. Analysis of responses grouped by stakeholders was performed with McNemar test. Stuart-Maxwell test was used to compare reading level among chatbots. Intraclass correlation coefficient was calculated.RESULTS: ChatGPT-4 demonstrated significantly improved quality over Claude 2 in 14 of 36 (38.9%) questions, whereas higher-quality scores for Claude 2 were only observed in 2 (5.6%) answers. Chatbots exhibited variation across the dimensions of "Accuracy," "Clarity," "Completeness," "Relevance," and "Usefulness," with ChatGPT-4 demonstrating a statistically significant superior performance. However, no statistically significant difference was found in the assessment of "Sources." Additionally, ChatGPT-4 provided information at a significant lower reading grade level.CONCLUSIONS: Artificial intelligence platforms failed to consistently provide accurate information surrounding the management of vestibular schwannoma, although ChatGPT-4 achieved significantly higher scores in most analyzed parameters. These findings demonstrate the potential for significant misinformation for patients seeking information through these platforms.
AB - OBJECTIVE: To examine the quality of information provided by artificial intelligence platforms ChatGPT-4 and Claude 2 surrounding the management of vestibular schwannomas.STUDY DESIGN: Cross-sectional.SETTING: Skull base surgeons were involved from different centers and countries.INTERVENTION: Thirty-six questions regarding vestibular schwannoma management were tested. Artificial intelligence responses were subsequently evaluated by 19 lateral skull base surgeons using the Quality Assessment of Medical Artificial Intelligence (QAMAI) questionnaire, assessing "Accuracy," "Clarity," "Relevance," "Completeness," "Sources," and "Usefulness."MAIN OUTCOME MEASURE: The scores of the answers from both chatbots were collected and analyzed using the Student t test. Analysis of responses grouped by stakeholders was performed with McNemar test. Stuart-Maxwell test was used to compare reading level among chatbots. Intraclass correlation coefficient was calculated.RESULTS: ChatGPT-4 demonstrated significantly improved quality over Claude 2 in 14 of 36 (38.9%) questions, whereas higher-quality scores for Claude 2 were only observed in 2 (5.6%) answers. Chatbots exhibited variation across the dimensions of "Accuracy," "Clarity," "Completeness," "Relevance," and "Usefulness," with ChatGPT-4 demonstrating a statistically significant superior performance. However, no statistically significant difference was found in the assessment of "Sources." Additionally, ChatGPT-4 provided information at a significant lower reading grade level.CONCLUSIONS: Artificial intelligence platforms failed to consistently provide accurate information surrounding the management of vestibular schwannoma, although ChatGPT-4 achieved significantly higher scores in most analyzed parameters. These findings demonstrate the potential for significant misinformation for patients seeking information through these platforms.
KW - Artificial Intelligence
KW - Cross-Sectional Studies
KW - Humans
KW - Neuroma, Acoustic
KW - Surveys and Questionnaires
KW - AI
KW - Vestibular schwannomas
KW - Chatbots
KW - Claude
KW - GPT
KW - QAMAI
KW - Acoustic neuroma
KW - ChatGPT
KW - Artificial intelligence
KW - VS
UR - http://www.scopus.com/inward/record.url?scp=85218781094&partnerID=8YFLogxK
U2 - 10.1097/MAO.0000000000004410
DO - 10.1097/MAO.0000000000004410
M3 - Journal article
C2 - 39965220
SN - 1531-7129
VL - 46
SP - 432
EP - 436
JO - Otology & neurotology : official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology
JF - Otology & neurotology : official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology
IS - 4
ER -