Esteva 2017 — Dermatologist-level classification of skin cancer with deep neural networks
Citation
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017 Feb 2;542(7639):115–118. DOI: 10.1038/nature21056. PMID 28117445.
Study design and population
Retrospective diagnostic-accuracy validation of an Inception-v3 CNN trained on 129,450 clinical images across 2,032 disease classes; head-to-head reader study against 21 board-certified dermatologists on two binary tasks — keratinocyte carcinoma vs. benign seborrheic keratosis; malignant melanoma vs. benign nevi — across clinical photography and dermoscopy. Single-institution (Stanford) development; biopsy-proven test sets.
Reported metrics
- Keratinocyte carcinoma vs. benign seborrheic keratosis: AUC 0.96
- Melanoma vs. benign nevi (clinical photography): AUC 0.94
- Melanoma vs. benign nevi (dermoscopy): AUC 0.91
- CNN sensitivity/specificity operating points match or exceed the mean dermatologist operating point across all three tasks
- 95 % CIs not reported in the primary paper
Surrogate-to-outcome linkage
Establishes diagnostic accuracy (AUC, sensitivity, specificity against histopathology) as the accepted output-level surrogate for a dermatology image-classifier. The paper positions classification accuracy as a direct proxy for the biopsy/referral decision, which sits on the causal pathway to earlier-stage detection (and, in melanoma, to melanoma-specific survival per the AJCC stage-survival gradient).
CRIT1–7 appraisal
| Criterion | Score | Justification |
|---|---|---|
| CRIT1 Relevance | 3 | Direct match — image-based dermatology AI on malignancy classification; surrogate domain is diagnostic accuracy (7GH). |
| CRIT2 Methodology | 2 | Large training corpus; well-defined test sets; head-to-head against 21 dermatologists with histopathology reference. Not a prospective clinical deployment. |
| CRIT3 Reporting | 2 | AUCs and reader operating points reported; no 95 % CIs; methods reproducible. |
| CRIT4 Applicability | 2 | Consistent with intended use (clinician-supervised decision support). Limited Fitzpatrick-IV–VI representation. |
| CRIT5 Evidence weight | 1 | Retrospective reader study on curated test sets. |
| CRIT6 Risk of bias | 2 | Spectrum bias (biopsy-preselected lesions); single institution; curated image quality may not generalise. |
| CRIT7 Contribution | 3 | Foundational reference establishing dermatologist-level AI classification; universally cited as the anchor for the diagnostic-accuracy surrogate. |
Aggregate: strong (landmark inclusion).
Limitations and notes
Spectrum bias; curated image quality; Fitzpatrick I–III dominant; single institution. Paired balancing references required (Daneshjou 2022, Han 2018) to declare generalisability limits.
Strength as anchor
Strong for accepted-surrogate claim (diagnostic accuracy as the canonical MDSW-output endpoint in dermatology AI). Not load-bearing for the quantitative surrogate-to-outcome magnitude (the AJCC staging literature carries that anchor).