Abstract
Humans have been shown to have biases when reading medical images, raising questions about whether humans are uniform in their disease gradings. Artificial intelligence (AI) tools trained on human-labeled data may have inherent human non-uniformity. In this study, we used a radiographic knee osteoarthritis external validation dataset of 50 patients and a six-year retrospective consecutive clinical cohort of 8,273 patients. An FDA-approved and CE-marked AI tool was tested for potential non-uniformity in Kellgren-Lawrence grades between the right and left sides of the images. We flipped the images horizontally so that a left knee looked like a right knee and vice versa. According to human review, the AI tool showed non-uniformity with 20-22% disagreements on the external validation dataset and 13.6% on the cohort. However, we found no evidence of a significant difference in the accuracy compared to senior radiologists on the external validation dataset, or age bias or sex bias on the cohort. AI non-uniformity can boost the evaluated performance against humans, but image areas with inferior performance should be investigated.
| Originalsprog | Engelsk |
|---|---|
| Artikelnummer | 26782 |
| Tidsskrift | Scientific Reports |
| Vol/bind | 14 |
| Udgave nummer | 1 |
| Sider (fra-til) | 26782 |
| ISSN | 2045-2322 |
| DOI | |
| Status | Udgivet - 5 nov. 2024 |