Esteva 2017 — Dermatologist-level classification of skin cancer with deep neural networks

Citation

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017 Feb 2;542(7639):115–118. DOI: 10.1038/nature21056. PMID 28117445.

Study design and population

Retrospective diagnostic-accuracy validation of an Inception-v3 CNN trained on 129,450 clinical images across 2,032 disease classes; head-to-head reader study against 21 board-certified dermatologists on two binary tasks — keratinocyte carcinoma vs. benign seborrheic keratosis; malignant melanoma vs. benign nevi — across clinical photography and dermoscopy. Single-institution (Stanford) development; biopsy-proven test sets.

Reported metrics

Keratinocyte carcinoma vs. benign seborrheic keratosis: AUC 0.96
Melanoma vs. benign nevi (clinical photography): AUC 0.94
Melanoma vs. benign nevi (dermoscopy): AUC 0.91
CNN sensitivity/specificity operating points match or exceed the mean dermatologist operating point across all three tasks
95 % CIs not reported in the primary paper

Surrogate-to-outcome linkage

Establishes diagnostic accuracy (AUC, sensitivity, specificity against histopathology) as the accepted output-level surrogate for a dermatology image-classifier. The paper positions classification accuracy as a direct proxy for the biopsy/referral decision, which sits on the causal pathway to earlier-stage detection (and, in melanoma, to melanoma-specific survival per the AJCC stage-survival gradient).

CRIT1–7 appraisal

Criterion	Score	Justification
CRIT1 Relevance	3	Direct match — image-based dermatology AI on malignancy classification; surrogate domain is diagnostic accuracy (7GH).
CRIT2 Methodology	2	Large training corpus; well-defined test sets; head-to-head against 21 dermatologists with histopathology reference. Not a prospective clinical deployment.
CRIT3 Reporting	2	AUCs and reader operating points reported; no 95 % CIs; methods reproducible.
CRIT4 Applicability	2	Consistent with intended use (clinician-supervised decision support). Limited Fitzpatrick-IV–VI representation.
CRIT5 Evidence weight	1	Retrospective reader study on curated test sets.
CRIT6 Risk of bias	2	Spectrum bias (biopsy-preselected lesions); single institution; curated image quality may not generalise.
CRIT7 Contribution	3	Foundational reference establishing dermatologist-level AI classification; universally cited as the anchor for the diagnostic-accuracy surrogate.

Aggregate: strong (landmark inclusion).

Limitations and notes

Spectrum bias; curated image quality; Fitzpatrick I–III dominant; single institution. Paired balancing references required (Daneshjou 2022, Han 2018) to declare generalisability limits.

Strength as anchor

Strong for accepted-surrogate claim (diagnostic accuracy as the canonical MDSW-output endpoint in dermatology AI). Not load-bearing for the quantitative surrogate-to-outcome magnitude (the AJCC staging literature carries that anchor).

Citation​

Study design and population​

Reported metrics​

Surrogate-to-outcome linkage​

CRIT1–7 appraisal​

Limitations and notes​

Strength as anchor​