Daneshjou 2022 — Disparities in dermatology AI performance on a diverse clinical image set (DDI) [BALANCING]
Citation
Daneshjou R, Vodrahalli K, Novoa RA, Jenkins M, Liang W, Rotemberg V, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022 Aug 12;8(32):eabq6147. DOI: 10.1126/sciadv.abq6147. PMID 35960806.
Study design and population
External-validation / bias audit using the Diverse Dermatology Images (DDI) dataset — 656 pathologically confirmed clinical images across Fitzpatrick I–VI. Three state-of-the-art dermatology AI models evaluated: ModelDerm, DeepDerm, HAM10000-trained.
Reported metrics
- DDI overall AUC: ModelDerm 0.65 (95 % CI 0.61–0.70); DeepDerm 0.56 (0.51–0.61); HAM10000 0.67 (0.62–0.71) — 27–36 % drop vs. original benchmark reports
- FST I–II vs. V–VI stratification: HAM10000 AUC 0.72 (0.63–0.79) vs. 0.57 (0.48–0.67)
- Balanced-accuracy gap: ModelDerm 0.67 (FST I–II) → 0.51 (FST V–VI)
- Fine-tuning on DDI partially closed the gap
Surrogate-to-outcome linkage
Quantifies skin-tone spectrum bias — the surrogate-to-outcome chain breaks for under-represented populations if the diagnostic-accuracy claim is uniform across phototypes. This is the MANDATORY balancing reference: demonstrates that current diagnostic-accuracy performance is overstated for FST IV–VI and must be addressed by stratified PMCF performance monitoring and diverse training data.
CRIT1–7 appraisal
| Criterion | Score | Justification |
|---|---|---|
| CRIT1 Relevance | 3 | Direct — quantifies the generalisability limit of the primary surrogate. |
| CRIT2 Methodology | 3 | Purpose-built diverse dataset; multi-model comparison; FST-stratified analysis; fine-tuning experiments. |
| CRIT3 Reporting | 3 | AUC and balanced-accuracy with 95 % CIs, stratified by phototype. |
| CRIT4 Applicability | 3 | Directly addresses the intended-population equity requirement under MDR Annex I §17.2. |
| CRIT5 Evidence weight | 1 | Retrospective external-validation / bias-audit study. |
| CRIT6 Risk of bias | 2 | Moderate dataset size (656 images); single-institution curation; Fitzpatrick scale imperfect proxy for melanin; limited long-tail disease coverage. |
| CRIT7 Contribution | 3 | MANDATORY balancing reference — quantifies the critical failure mode of the surrogate and anchors the PMCF subgroup-monitoring commitment. |
Aggregate: very strong (as a balancing reference).
Limitations and notes
Fitzpatrick scale known to be a coarse melanin proxy; single-institution curation; small per-FST stratum sizes; legacy models not all designed for non-dermoscopic clinical images.
Strength as anchor
Mandatory inclusion. Demonstrates balanced citation practice (per BSI Erin's attention to selective citation) and directly motivates the PMCF stratified-performance-monitoring commitment in R-TF-007-002. Supported by Han 2018 (cross-ethnicity) and informed by Dick 2019 (independent-test-set gap).