Tschandl 2020 — Human–computer collaboration for skin cancer recognition

Citation

Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, et al. Human–computer collaboration for skin cancer recognition. Nat Med. 2020 Aug;26(8):1229–1234. DOI: 10.1038/s41591-020-0942-0. PMID 32572267.

Study design and population

Pre-registered, international web-based reader study of three AI decision-support formats. 302 physicians (169 board-certified dermatologists, 77 residents, 38 GPs, 18 other) classified 1,511 dermoscopic images across 7 diagnostic categories, with and without AI support. HAM10000-derived test set.

Reported metrics

Unaided multiclass accuracy 63.6 % (95 % CI 62.6–64.5)
AI-assisted (multiclass probability) accuracy 77.0 % (95 % CI 76.2–77.9)
Absolute uplift +13.3 pp (p < 0.001); largest gain among least-experienced clinicians
Identified a safety hazard: faulty AI output can mislead even expert clinicians

Surrogate-to-outcome linkage

Directly evidences that AI decision support improves clinician classification accuracy — the exact operational claim of a Class IIb dermatology CDS. The magnitude of uplift (~+13 pp, largest for non-specialists) maps onto the device's intended benefit of reducing diagnostic error at the primary-care / teledermatology triage step, on the causal path to earlier appropriate treatment.

CRIT1–7 appraisal

Criterion	Score	Justification
CRIT1 Relevance	3	Direct match — clinician + AI decision-support workflow on dermoscopic classification.
CRIT2 Methodology	3	Large, international, pre-registered design with multiple AI-support formats; 302 physicians; reference standard histopathology/consensus.
CRIT3 Reporting	3	Accuracy with 95 % CIs reported; safety-hazard identification documented.
CRIT4 Applicability	3	Workflow analogous to CDS use; tested across dermatologist / resident / GP tiers.
CRIT5 Evidence weight	2	Large prospective reader study (not RCT, not meta-analysis).
CRIT6 Risk of bias	2	Simulation, not deployment; HAM10000 phototype skew; documented automation-bias risk with faulty AI output.
CRIT7 Contribution	3	Core anchor for the directional claim — AI support translates to classification improvement, with effect size quantified.

Aggregate: very strong.

Limitations and notes

Simulated reader workflow, not real deployment; HAM10000 underlying phototype imbalance; safety hazard of faulty AI output explicitly characterised (feature, not flaw — belongs in the risk-management narrative).

Strength as anchor

Very strong for directional + operational mechanism claim. Reports 95 % CIs, making it methodologically robust. The documented "faulty AI can mislead experts" finding also supports the integrator-responsibility language in the CER.

Citation​

Study design and population​

Reported metrics​

Surrogate-to-outcome linkage​

CRIT1–7 appraisal​

Limitations and notes​

Strength as anchor​