Tschandl 2020 — Human–computer collaboration for skin cancer recognition
Citation
Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, et al. Human–computer collaboration for skin cancer recognition. Nat Med. 2020 Aug;26(8):1229–1234. DOI: 10.1038/s41591-020-0942-0. PMID 32572267.
Study design and population
Pre-registered, international web-based reader study of three AI decision-support formats. 302 physicians (169 board-certified dermatologists, 77 residents, 38 GPs, 18 other) classified 1,511 dermoscopic images across 7 diagnostic categories, with and without AI support. HAM10000-derived test set.
Reported metrics
- Unaided multiclass accuracy 63.6 % (95 % CI 62.6–64.5)
- AI-assisted (multiclass probability) accuracy 77.0 % (95 % CI 76.2–77.9)
- Absolute uplift +13.3 pp (p < 0.001); largest gain among least-experienced clinicians
- Identified a safety hazard: faulty AI output can mislead even expert clinicians
Surrogate-to-outcome linkage
Directly evidences that AI decision support improves clinician classification accuracy — the exact operational claim of a Class IIb dermatology CDS. The magnitude of uplift (~+13 pp, largest for non-specialists) maps onto the device's intended benefit of reducing diagnostic error at the primary-care / teledermatology triage step, on the causal path to earlier appropriate treatment.
CRIT1–7 appraisal
| Criterion | Score | Justification |
|---|---|---|
| CRIT1 Relevance | 3 | Direct match — clinician + AI decision-support workflow on dermoscopic classification. |
| CRIT2 Methodology | 3 | Large, international, pre-registered design with multiple AI-support formats; 302 physicians; reference standard histopathology/consensus. |
| CRIT3 Reporting | 3 | Accuracy with 95 % CIs reported; safety-hazard identification documented. |
| CRIT4 Applicability | 3 | Workflow analogous to CDS use; tested across dermatologist / resident / GP tiers. |
| CRIT5 Evidence weight | 2 | Large prospective reader study (not RCT, not meta-analysis). |
| CRIT6 Risk of bias | 2 | Simulation, not deployment; HAM10000 phototype skew; documented automation-bias risk with faulty AI output. |
| CRIT7 Contribution | 3 | Core anchor for the directional claim — AI support translates to classification improvement, with effect size quantified. |
Aggregate: very strong.
Limitations and notes
Simulated reader workflow, not real deployment; HAM10000 underlying phototype imbalance; safety hazard of faulty AI output explicitly characterised (feature, not flaw — belongs in the risk-management narrative).
Strength as anchor
Very strong for directional + operational mechanism claim. Reports 95 % CIs, making it methodologically robust. The documented "faulty AI can mislead experts" finding also supports the integrator-responsibility language in the CER.