Daneshjou 2022 — Disparities in dermatology AI performance on a diverse clinical image set (DDI) [BALANCING]

Citation

Daneshjou R, Vodrahalli K, Novoa RA, Jenkins M, Liang W, Rotemberg V, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022 Aug 12;8(32):eabq6147. DOI: 10.1126/sciadv.abq6147. PMID 35960806.

Study design and population

External-validation / bias audit using the Diverse Dermatology Images (DDI) dataset — 656 pathologically confirmed clinical images across Fitzpatrick I–VI. Three state-of-the-art dermatology AI models evaluated: ModelDerm, DeepDerm, HAM10000-trained.

Reported metrics

DDI overall AUC: ModelDerm 0.65 (95 % CI 0.61–0.70); DeepDerm 0.56 (0.51–0.61); HAM10000 0.67 (0.62–0.71) — 27–36 % drop vs. original benchmark reports
FST I–II vs. V–VI stratification: HAM10000 AUC 0.72 (0.63–0.79) vs. 0.57 (0.48–0.67)
Balanced-accuracy gap: ModelDerm 0.67 (FST I–II) → 0.51 (FST V–VI)
Fine-tuning on DDI partially closed the gap

Surrogate-to-outcome linkage

Quantifies skin-tone spectrum bias — the surrogate-to-outcome chain breaks for under-represented populations if the diagnostic-accuracy claim is uniform across phototypes. This is the MANDATORY balancing reference: demonstrates that current diagnostic-accuracy performance is overstated for FST IV–VI and must be addressed by stratified PMCF performance monitoring and diverse training data.

CRIT1–7 appraisal

Criterion	Score	Justification
CRIT1 Relevance	3	Direct — quantifies the generalisability limit of the primary surrogate.
CRIT2 Methodology	3	Purpose-built diverse dataset; multi-model comparison; FST-stratified analysis; fine-tuning experiments.
CRIT3 Reporting	3	AUC and balanced-accuracy with 95 % CIs, stratified by phototype.
CRIT4 Applicability	3	Directly addresses the intended-population equity requirement under MDR Annex I §17.2.
CRIT5 Evidence weight	1	Retrospective external-validation / bias-audit study.
CRIT6 Risk of bias	2	Moderate dataset size (656 images); single-institution curation; Fitzpatrick scale imperfect proxy for melanin; limited long-tail disease coverage.
CRIT7 Contribution	3	MANDATORY balancing reference — quantifies the critical failure mode of the surrogate and anchors the PMCF subgroup-monitoring commitment.

Aggregate: very strong (as a balancing reference).

Limitations and notes

Fitzpatrick scale known to be a coarse melanin proxy; single-institution curation; small per-FST stratum sizes; legacy models not all designed for non-dermoscopic clinical images.

Strength as anchor

Mandatory inclusion. Demonstrates balanced citation practice (per BSI Erin's attention to selective citation) and directly motivates the PMCF stratified-performance-monitoring commitment in R-TF-007-002. Supported by Han 2018 (cross-ethnicity) and informed by Dick 2019 (independent-test-set gap).

Citation​

Study design and population​

Reported metrics​

Surrogate-to-outcome linkage​

CRIT1–7 appraisal​

Limitations and notes​

Strength as anchor​