TY - JOUR
T1 - Interobserver Agreement and Performance of Concurrent AI Assistance for Radiographic Evaluation of Knee Osteoarthritis
AU - Brejnebøl, Mathias W
AU - Lenskjold, Anders
AU - Ziegeler, Katharina
AU - Ruitenbeek, Huib
AU - Müller, Felix C
AU - Nybing, Janus U
AU - Visser, Jacob J
AU - Schiphouwer, Loes M
AU - Jasper, Jorrit
AU - Bashian, Behschad
AU - Cao, Haoyin
AU - Muellner, Maximilian
AU - Dahlmann, Sebastian A
AU - Radev, Dimitar I
AU - Ganestam, Ann
AU - Nielsen, Camilla T
AU - Stroemmen, Carsten U
AU - Oei, Edwin H G
AU - Hermann, Kay-Geert A
AU - Boesen, Mikael
PY - 2024/7
Y1 - 2024/7
N2 - Background Due to conflicting findings in the literature, there are concerns about a lack of objectivity in grading knee osteoarthritis (KOA) on radiographs. Purpose To examine how artificial intelligence (AI) assistance affects the performance and interobserver agreement of radiologists and orthopedists of various experience levels when evaluating KOA on radiographs according to the established Kellgren-Lawrence (KL) grading system. Materials and Methods In this retrospective observer performance study, consecutive standing knee radiographs from patients with suspected KOA were collected from three participating European centers between April 2019 and May 2022. Each center recruited four readers across radiology and orthopedic surgery at in-training and board-certified experience levels. KL grading (KL-0 = no KOA, KL-4 = severe KOA) on the frontal view was assessed by readers with and without assistance from a commercial AI tool. The majority vote of three musculoskeletal radiology consultants established the reference standard. The ordinal receiver operating characteristic method was used to estimate grading performance. Light kappa was used to estimate interrater agreement, and bootstrapped t statistics were used to compare groups. Results Seventy-five studies were included from each center, totaling 225 studies (mean patient age, 55 years ± 15 [SD]; 113 female patients). The KL grades were KL-0, 24.0% (n = 54); KL-1, 28.0% (n = 63); KL-2, 21.8% (n = 49); KL-3, 18.7% (n = 42); and KL-4, 7.6% (n = 17). Eleven readers completed their readings. Three of the six junior readers showed higher KL grading performance with versus without AI assistance (area under the receiver operating characteristic curve, 0.81 ± 0.017 [SEM] vs 0.88 ± 0.011 [P < .001]; 0.76 ± 0.018 vs 0.86 ± 0.013 [P < .001]; and 0.89 ± 0.011 vs 0.91 ± 0.009 [P = .008]). Interobserver agreement for KL grading among all readers was higher with versus without AI assistance (κ = 0.77 ± 0.018 [SEM] vs 0.85 ± 0.013; P < .001). Board-certified radiologists achieved almost perfect agreement for KL grading when assisted by AI (κ = 0.90 ± 0.01), which was higher than that achieved by the reference readers independently (κ = 0.84 ± 0.017; P = .01). Conclusion AI assistance increased junior readers' radiographic KOA grading performance and increased interobserver agreement for osteoarthritis grading across all readers and experience levels. Published under a CC BY 4.0 license. Supplemental material is available for this article.
AB - Background Due to conflicting findings in the literature, there are concerns about a lack of objectivity in grading knee osteoarthritis (KOA) on radiographs. Purpose To examine how artificial intelligence (AI) assistance affects the performance and interobserver agreement of radiologists and orthopedists of various experience levels when evaluating KOA on radiographs according to the established Kellgren-Lawrence (KL) grading system. Materials and Methods In this retrospective observer performance study, consecutive standing knee radiographs from patients with suspected KOA were collected from three participating European centers between April 2019 and May 2022. Each center recruited four readers across radiology and orthopedic surgery at in-training and board-certified experience levels. KL grading (KL-0 = no KOA, KL-4 = severe KOA) on the frontal view was assessed by readers with and without assistance from a commercial AI tool. The majority vote of three musculoskeletal radiology consultants established the reference standard. The ordinal receiver operating characteristic method was used to estimate grading performance. Light kappa was used to estimate interrater agreement, and bootstrapped t statistics were used to compare groups. Results Seventy-five studies were included from each center, totaling 225 studies (mean patient age, 55 years ± 15 [SD]; 113 female patients). The KL grades were KL-0, 24.0% (n = 54); KL-1, 28.0% (n = 63); KL-2, 21.8% (n = 49); KL-3, 18.7% (n = 42); and KL-4, 7.6% (n = 17). Eleven readers completed their readings. Three of the six junior readers showed higher KL grading performance with versus without AI assistance (area under the receiver operating characteristic curve, 0.81 ± 0.017 [SEM] vs 0.88 ± 0.011 [P < .001]; 0.76 ± 0.018 vs 0.86 ± 0.013 [P < .001]; and 0.89 ± 0.011 vs 0.91 ± 0.009 [P = .008]). Interobserver agreement for KL grading among all readers was higher with versus without AI assistance (κ = 0.77 ± 0.018 [SEM] vs 0.85 ± 0.013; P < .001). Board-certified radiologists achieved almost perfect agreement for KL grading when assisted by AI (κ = 0.90 ± 0.01), which was higher than that achieved by the reference readers independently (κ = 0.84 ± 0.017; P = .01). Conclusion AI assistance increased junior readers' radiographic KOA grading performance and increased interobserver agreement for osteoarthritis grading across all readers and experience levels. Published under a CC BY 4.0 license. Supplemental material is available for this article.
KW - Humans
KW - Female
KW - Observer Variation
KW - Male
KW - Osteoarthritis, Knee/diagnostic imaging
KW - Middle Aged
KW - Retrospective Studies
KW - Artificial Intelligence
KW - Radiography/methods
KW - Aged
UR - http://www.scopus.com/inward/record.url?scp=85198298796&partnerID=8YFLogxK
U2 - 10.1148/radiol.233341
DO - 10.1148/radiol.233341
M3 - Journal article
C2 - 38980184
SN - 0033-8419
VL - 312
SP - e233341
JO - Radiology
JF - Radiology
IS - 1
M1 - e233341
ER -