Surrogate-Endpoint Validity in Dermatology AI — Structured Literature Review

Purpose: Pillar 1 Valid Clinical Association (VCA) evidence under MDCG 2020-1 §4.4 for a Class IIb AI-based dermatology clinical decision support device. Anchors the three surrogate-endpoint families on which the three declared clinical benefits rest. Supplementary appendix to R-TF-015-011 State of the Art; source document for the expanded Pillar 1 narrative in R-TF-015-003 Clinical Evaluation Report; referenced from the R-TF-015-001 Clinical Evaluation Plan evidence-hierarchy table.

1. Purpose and scope

1.1 Why this review exists

The evidence base for the device demonstrates clinical benefit indirectly, via performance-based endpoints (diagnostic accuracy, sensitivity / specificity, AUC) and workflow-related endpoints (referral appropriateness, waiting-time reduction, severity-score-driven treatment decisions), rather than directly via patient-outcome endpoints (e.g., melanoma-specific survival, time-to-treatment, DLQI). This indirect demonstration is consistent with the intended use of the device as a clinical decision support system — the device does not independently trigger clinical actions.

For a Class IIb device, indirect demonstration of clinical benefit is subject to elevated scrutiny. Current external methodological expert advice (received 2026-04-17) identifies two named mitigations at resubmission; a structured literature review establishing that the selected surrogate endpoints are accepted proxies for patient-relevant outcomes in dermatology is one of the two. This document delivers that review. The complementary mitigation — a formalised PMCF strategy — is delivered via R-TF-007-002 PMCF Plan.

1.2 Scope

The review covers three surrogate-endpoint families, each mapped 1:1 to a declared clinical benefit:

Surrogate family	Declared clinical benefit	Canonical performance metrics
Diagnostic accuracy (including malignancy detection / triage)	7GH	Top-1 / Top-5 accuracy · sensitivity · specificity · AUC · PPV / NPV · concordance with reference standard
Severity scoring (objective, reproducible, longitudinal)	5RB	ICC · correlation with reference scale (PASI, EASI, SCORAD, SALT, IGA, GAGS) · kappa · inter-observer variability reduction
Referral optimisation / care-pathway (including remote care)	3KX	Referral appropriateness rate · waiting-time reduction · remote-assessment adequacy · proportion manageable remotely

1.3 What the review is and is not

This review is a structured literature review anchored to peer-reviewed primary studies, systematic reviews, meta-analyses, regulator-accepted guidelines and international consensus statements. It is not a full PRISMA systematic review with flow diagram — scope as recommended by the external methodological review received 2026-04-17.

The review is the formal, literature-anchored Pillar 1 VCA evidence. It does not claim direct patient-outcome improvement from our own studies; the claim is: "the surrogate endpoints we measure are accepted proxies for patient-relevant outcomes in the peer-reviewed dermatology literature." Our own clinical performance evidence (Pillar 2 analytic validity and Pillar 3 clinical validity) sits in separate sections.

2. Regulatory framework

2.1 MDCG 2020-1 Pillar 1 — Valid Clinical Association

MDCG 2020-1 §4.4 defines three pillars of evidence for Medical Device Software (MDSW):

Pillar 1 — Valid Clinical Association (VCA): the extent to which the MDSW output is clinically associated with the targeted physiological state or clinical condition, grounded either in a scientific framework or in a sufficient level of evidence in the broad medical community.
Pillar 2 — Technical Performance: the ability of the MDSW to accurately, reliably and precisely generate the intended output from the input data.
Pillar 3 — Clinical Performance: the ability of the MDSW to yield clinically relevant output consistent with the intended purpose in the target population.

This review is the Pillar 1 evidence layer, supplying the literature anchor that surrogate-endpoint families selected by the device are accepted proxies for patient-relevant outcomes in peer-reviewed dermatology.

2.2 MDCG 2020-6 — sufficient clinical evidence

MDCG 2020-6 requires a structured justification that clinical evidence is sufficient for the benefit–risk determination under MDR Annex XIV Part A. The three-pillar evidence model is the mechanism for that justification; Pillar 1 literature evidence is the base of the pyramid for Class IIb software-as-a-medical-device.

2.3 MDR Annex I §6.1(a) and (b)

MDR Annex I §6.1(a) requires that devices achieve the performances intended by the manufacturer; §6.1(b) requires design and manufacture suited to the intended purpose, taking into account generally acknowledged state of the art. This review demonstrates that the surrogate-endpoint families are generally acknowledged state of the art for dermatology clinical decision support — satisfying the "generally acknowledged" component of §6.1(b).

2.4 MEDDEV 2.7/1 Rev 4 literature-review methodology

The CRIT1–7 appraisal framework applied to each included reference (appraisal-log.md) is consistent with MEDDEV 2.7/1 Rev 4 Section 9.3.1 (appraisal of data) and is the same framework used in the R-TF-015-001 Clinical Evaluation Plan literature-review methodology.

3. Methodology

3.1 Search strategy

Searches were conducted across PubMed, Scopus, Google Scholar and Cochrane, complemented by regulator-published guideline documents (EMA, FDA), international consensus statements (HOME, NAAF, HiSCR), and cited-reference snowballing from landmark papers. Three external deep-research tools (Perplexity, Gemini Deep Research, Claude) were used in parallel to triangulate candidate citations; all included references were individually verified against PubMed records and publisher landing pages for DOI, author list, volume / pages and primary effect sizes (see research-results/ for raw tool outputs).

Searches were scoped to human peer-reviewed literature on dermatology diagnostic AI, severity-score instruments (PASI, EASI, SCORAD, SALT, IGA, GAGS, HiSCR), teledermatology, AI-assisted triage, melanoma and non-melanoma skin-cancer stage-outcome epidemiology, and regulator-accepted clinical-endpoint guidance. No date restriction was imposed; preference was given to publications from 2010 onward where modern AI methodology is applicable. Foundational instrument-definition papers (Fredriksson & Pettersson 1978; European Task Force on AD 1993; Olsen 2004; Doshi 1997) were retained regardless of age.

3.2 Inclusion and exclusion

Included:

Peer-reviewed primary studies reporting diagnostic-accuracy, severity-score, or care-pathway metrics against a reference standard
Systematic reviews and meta-analyses aggregating the above
Regulator-published guidelines defining accepted clinical endpoints
International consensus statements (HOME, NAAF) defining severity-score instruments and thresholds
Landmark outcome studies anchoring stage-at-detection to survival (AJCC evidence base; surgical-delay cohort studies)

Excluded:

Non-peer-reviewed sources (blogs, vendor white papers, unreviewed preprints)
Case reports and small case series where larger-cohort evidence exists
Computer-science-only studies without clinical dermatology reference standard

3.3 Appraisal — CRIT1–7

Each included reference is appraised under the seven CRIT criteria used in the R-TF-015-001 CEP literature-review methodology:

CRIT1 Relevance to device / indication / surrogate domain
CRIT2 Quality of study methodology (design, sample size, controls)
CRIT3 Quality of reporting (endpoint definitions, statistical analysis, 95 % CIs)
CRIT4 Applicability to intended population (image-based dermatology, clinician-supervised use)
CRIT5 Evidence weight (1 retrospective / validation · 2 RCT / prospective cohort / consensus · 3 meta-analysis / systematic review / regulatory guideline)
CRIT6 Risk of bias
CRIT7 Contribution to specific surrogate-validity claim

Each reference is scored 1–3 per CRIT with justification in references/<domain>/<first-author-year-keyword>.md; the rolling table is appraisal-log.md.

3.4 What each domain must establish

For each of the three surrogate-endpoint families, the review establishes three claims:

Accepted-surrogate claim — the surrogate is an accepted clinical endpoint in peer-reviewed dermatology literature, with published validation.
Directional claim — improvements in the surrogate are associated with improvements in patient-relevant outcomes.
Quantitative claim — the magnitude of the surrogate-to-outcome association has been estimated in at least one peer-reviewed source.

3.5 Evidence-base size

Surrogate domain	Minimum peer-reviewed references	Target	Achieved
Diagnostic accuracy (7GH)	≥ 8	10–12	13
Severity scoring (5RB)	≥ 6	8–10	10
Referral optimisation / pathway (3KX)	≥ 6	8–10	9
Total	≥ 20	26–32	32

Mandatory balancing references (phototype bias, consumer-AI heterogeneity) are intentionally included to ensure the review reads as balanced and not selectively cited.

4. Surrogate domain 1 — Diagnostic accuracy (benefit 7GH)

4.1 Accepted-surrogate claim

Diagnostic accuracy — expressed as sensitivity, specificity, ROC-AUC, top-k concordance with histopathology and κ agreement with face-to-face dermatology — is the canonical primary endpoint across every landmark dermatology-AI pivotal study (Esteva 2017 Nature; Haenssle 2018 Ann Oncol; Haenssle 2020 Ann Oncol — CE-marked CNN; Tschandl 2020 Nat Med; Liu 2020 Nat Med) and is the analytic unit in the highest-tier aggregate evidence (Dick 2019 JAMA Dermatol meta-analysis of 70 pooled studies; Salinas 2024 NPJ Digit Med meta-analysis of 19 studies). For the remote-assessment analogue, the same endpoint class anchors Finnane 2017 (JAMA Dermatol systematic review) and the Chuchu 2018 Cochrane review. These metrics are codified in STARD 2015, STARD-AI, TRIPOD+AI (Collins 2024 BMJ), CONSORT-AI / SPIRIT-AI reporting standards and in the EMA 2024 reflection paper on AI across the medicinal-product lifecycle. A Class IIb AI CDS device is benchmarked on exactly this endpoint class.

4.2 Directional claim

The causal chain "better diagnostic accuracy → earlier stage at detection → earlier definitive excision → reduced morbidity and mortality" is load-bearing on the melanoma stage-at-detection → survival gradient — the single most robust surrogate-to-outcome linkage in oncodermatology. Multiple AI systems achieve sensitivities at or above dermatologists on melanoma recognition (Esteva 2017; Haenssle 2018 / 2020; Tschandl 2020), and AI-assisted diagnostic workflows reduce unnecessary benign excisions prospectively in real-world clinical practice (Winkler 2023 JAMA Dermatol — 19.2 % reduction in benign-nevus excisions with CE-marked CNN support, with zero missed melanomas in the supported arm). The key caveat for the CER is that no AI-dermatology RCT with mortality or stage-shift as a primary endpoint exists as of 2026-04; inference rests on reader-study equivalence plus the independently established AJCC stage-survival gradient. Generalisability is further constrained by skin-tone and phototype bias (Daneshjou 2022 Sci Adv; Han 2018 J Invest Dermatol), which can break the chain in under-represented subgroups — a residual risk declared and addressed in the PMCF plan.

4.3 Quantitative claim

AI vs. dermatologists — landmark reader studies. Esteva 2017 AUC 0.91–0.96 across three binary tasks; Haenssle 2018 CNN AUC 0.86 vs. dermatologist mean 0.79 (p < 0.01); Haenssle 2020 CE-marked CNN sensitivity 95.0 % (95 % CI 83.5–98.6), specificity 76.7 % (95 % CI 64.6–85.6), AUC 0.918 (95 % CI 0.866–0.970).
Pooled meta-analytic performance. Dick 2019 CAD melanoma pooled sensitivity 0.74 (95 % CI 0.66–0.80), specificity 0.84 (95 % CI 0.79–0.88); on independent test sets sensitivity drops to 0.51 (95 % CI 0.34–0.69), quantifying the spectrum-bias effect. Salinas 2024 pooled AI sensitivity 87.0 % (95 % CI 81.7–90.9), specificity 77.1 % (95 % CI 69.8–83.0); AI markedly superior to non-specialists (sens 92.5 % vs. 64.6 %).
Human + AI collaboration. Tschandl 2020 multiclass accuracy 63.6 % (95 % CI 62.6–64.5) → 77.0 % (95 % CI 76.2–77.9) with AI support; largest gain +13 pp in least-experienced clinicians.
Stage-at-detection → survival (outcome anchor). Gershenwald 2017 AJCC 8th edition 5-year melanoma-specific survival by substage: IA 99 %, IIC 82 %, IIIC 69 %, IIID 32 %. Time-to-surgery → OS (Conic 2018 NCDB): adjusted mortality hazard increases 5 % at 30–59 days, 41 % at > 119 days vs. ≤ 30 days.

4.4 Magnitude of clinical importance in the device's context

The device is a clinician-supervised decision support tool; the operative sub-claim is that AI-assisted diagnostic accuracy in primary-care and teledermatology settings raises non-specialist performance to dermatologist-level and reduces both false negatives and false positives. Salinas 2024 and Liu 2020 demonstrate the non-specialist-uplift effect. The magnitude of the downstream benefit — shifted stage distribution and reduced time-to-treatment — depends on stage-at-detection → survival and time-to-surgery → OS, both quantified above at effect sizes regulator-familiar from standard oncology evidence.

5. Surrogate domain 2 — Severity scoring (benefit 5RB)

5.1 Accepted-surrogate claim

PASI, EASI, SCORAD, IGA / vIGA-AD, SALT and GAGS are established regulatory endpoints across the European and US drug-approval dossiers for chronic inflammatory skin disease.

Psoriasis. PASI (Fredriksson & Pettersson 1978) is the primary efficacy variable in the EMA guideline on psoriasis clinical investigation (CHMP/EWP/2454/02, adopted 2004), with PASI-75 / PASI-90 / PASI-100 as response thresholds and IGA 0/1 as mandatory co-primary.
Atopic dermatitis. EASI is the single core clinical-signs instrument in the international HOME core outcome set (Schmitt 2014 HOME IV), paired with POEM for symptoms and DLQI for HRQoL. Dupilumab SOLO 1 / SOLO 2 (Simpson 2016 NEJM) used IGA 0/1 and EASI-75 as co-primary / key secondary endpoints.
Alopecia areata. SALT (Olsen 2004; NAAF guidelines) is the primary endpoint in baricitinib BRAVE-AA1 / BRAVE-AA2 (King 2022 NEJM), ritlecitinib ALLEGRO and deuruxolitinib pivotal trials — enabling the first systemic-therapy approvals.
Acne. GAGS (Doshi 1997) and IGA are used alongside lesion counts per FDA acne guidance.

These instruments are the instantiation of the severity-score surrogate in regulatory practice.

5.2 Directional claim

Improvements in severity scores track improvements in patient-relevant outcomes with a clear dose-response between depth of response and magnitude of HRQoL gain. In adalimumab trials (Revicki 2008), PASI-90–100 groups achieved > 10-point DLQI reductions vs. significantly smaller reductions in lower-responder groups (p < 0.001). In atopic dermatitis, EASI-75 responders in dupilumab SOLO 1 / 2 (Simpson 2016) showed parallel, clinically meaningful reductions in POEM, peak-pruritus NRS and DLQI. In alopecia areata, SALT ≤ 20 responders in BRAVE-AA (King 2022) had substantially higher rates of ClinRO eyebrow / eyelash response and patient-reported QoL improvement. The European treat-to-target consensus (Mrowietz 2011) formalises the operational rule: ΔPASI ≥ 75 % → continue; ΔPASI 50–< 75 % with DLQI ≤ 5 → continue, otherwise modify — explicitly coupling severity-score change with HRQoL threshold.

5.3 Quantitative claim

Aggregate PASI ↔ DLQI. Mattei 2014 systematic review of 13 biologic RCTs: r² = 0.80 between PASI % improvement and DLQI change at trial-arm level (individual-patient correlation moderate, Spearman ρ ≈ 0.40–0.57 in cohort data — CER states this explicitly).
Regulatory-endpoint magnitudes (pivotal trials). Simpson 2016 dupilumab SOLO 1 week-16 EASI-75 51 % vs. 14.7 % placebo; mean EASI % reduction 72 % vs. 38 %. King 2022 baricitinib BRAVE-AA1 week-36 SALT ≤ 20: 38.8 % (4 mg) vs. 6.2 % placebo; difference 32.6 pp (95 % CI 25.6–39.5).
Manual severity-score reliability ceiling. Fink 2018 image-based PASI inter-rater ICC 0.895, mean absolute difference 3.3 points — enough to flip PASI-75 / PASI-90 classification at borderline. Gourraud 2012 simulation showed inter-rater variability crosses therapeutic-decision thresholds.
AI-PASI analytic validity. Schaap 2022 CNN-vs-real-life-physician trunk ICCs 0.580–0.793, with CNN outperforming physicians on area scoring (0.793 vs. 0.694). Huang 2023 AI PASI MAE 2.05 points; AI outperformed 43-dermatologist mean by 33.2 % on PASI estimation.

5.4 Magnitude of clinical importance in the device's context

The device-generated severity score is the same measurement class as the regulator-accepted trial endpoint. The downstream treatment-decision chain is codified by Mrowietz 2011; the HRQoL benefit of reaching PASI-90 vs. PASI-75 is quantified by Mattei 2014 and Revicki 2008. Improving measurement reliability (by eliminating the ~3-point PASI inter-rater variance documented in Fink 2018) directly raises the fidelity of treatment-escalation decisions, on the causal path to improved disease control and HRQoL.

6. Surrogate domain 3 — Referral optimisation / care-pathway (benefit 3KX)

6.1 Accepted-surrogate claim

Referral appropriateness, waiting-time reduction, remote-assessment adequacy and proportion manageable remotely are widely accepted surrogate endpoints in peer-reviewed dermatology, in HTA evaluations (NICE, NIHR, CADTH, MSAC) and in national telemedicine roadmaps (NHS England Teledermatology Roadmap 2019). Systematic reviews in high-impact dermatology journals adopt these metrics when hard endpoints (melanoma-specific mortality) are unattainable within trial horizons (Finnane 2017 JAMA Dermatol; Chuchu 2018 Cochrane). Snoswell 2016 JAMA Dermatol synthesises 14 economic evaluations using cost-per-avoided-visit and cost-per-QALY as the health-economic anchor.

6.2 Directional claim

Improvements in these metrics translate into faster specialist access for genuine cases, improved geographic and equitable access, reduced system bottlenecks and non-inferior clinical outcomes at lower cost. Eminović 2009 cluster RCT — referral reduction 39.0 % vs. 18.3 % (difference 20.7 pp; 95 % CI 8.5–32.9) — is the highest-evidence anchor for referral appropriateness in an EU primary-care setting. Whited 2013 RCT demonstrates 9-month clinical-course equivalence between SAF teledermatology and conventional consultation for skin-cancer triage; Armstrong 2018 JAMA Netw Open pragmatic equivalency RCT demonstrates preserved disease-control outcomes (PASI / BSA) under online care for chronic inflammatory disease. Finnane 2017 and Bourkas 2023 / Chuchu 2018 document non-inferior diagnostic concordance of teledermatology with in-person assessment, the prerequisite for the surrogate to hold.

The key caveat for the CER is that no RCT directly demonstrates melanoma-specific mortality improvement from AI-triage dermatology; the inference runs via Conic 2018 (time-to-surgery → OS) and Gershenwald 2017 (stage → MSS). AI-triage-specific evidence (Jain 2021, Liu 2020) extends the teledermatology evidence base to AI-assisted non-specialist decision-making in a directly analogous workflow.

6.3 Quantitative claim

Referral reduction. Eminović 2009 absolute difference 20.7 pp (95 % CI 8.5–32.9). Giavina-Bianchi 2020 (n = 30,976; 55,624 lesions): 53 % of cases managed in primary care; 78 % reduction in mean waiting time (6.7 months → 1.5 months).
Waiting-time (EU setting). Moreno-Ramirez 2007 (Seville, 2,009 teleconsultations): filtering 51.20 % (95 % CI 49.00–53.40); waiting interval 12.31 days (teledermatology) vs. 88.62 days (letter referral) — ~76-day reduction, ~7× faster (p < 0.001).
Clinical-outcome equivalence. Armstrong 2018 PASI between-group difference −0.27 (95 % CI −0.85 to 0.31); BSA −0.05 % (95 % CI −1.58 to 1.48) — within ±3 equivalence bound. Whited 2013 9-month clinical-course outcomes: no significant difference (p = 0.88).
AI-decision-support uplift for non-specialists. Jain 2021 PCP agreement with dermatologist reference 48 % → 58 % with AI support (OR 2.0; 95 % CI 1.7–2.4); NP 46 % → 58 % (OR 2.2; 95 % CI 1.9–2.6).
Health economics. Snoswell 2016 systematic review — store-and-forward teledermatology cost-effective or cost-saving in the majority of 14 economic evaluations, scaling with patient–dermatologist distance.

6.4 Magnitude of clinical importance in the device's context

The device operates as a clinician-supervised CDS in teledermatology and primary-care contexts. The quantitative anchors above — 20.7 pp referral reduction, 78 % waiting-time reduction, +10–12 pp non-specialist diagnostic uplift with AI support, clinical-outcome equivalence to face-to-face care — match the device's intended operational benefit. The patient-outcome linkage runs via Conic 2018 (time-to-surgery → OS), which converts the waiting-time surrogate into a mortality-hazard gradient.

7. Cross-domain synthesis — the causal pathway

The three surrogate families articulate a single, regulator-facing causal chain from device output to patient benefit.

Diagnostic accuracy (Domain 1) is the proximal surrogate. Systematic-review and meta-analytic evidence (Dick 2019; Salinas 2024; Freeman 2020 BMJ) plus landmark reader studies (Esteva 2017; Haenssle 2018 / 2020; Tschandl 2020; Liu 2020) establish that AI-assisted classification matches or exceeds clinicians on reference-standard tasks, with the largest uplift for non-specialists (Tschandl 2020; Liu 2020; Jain 2021 — extending into Domain 3). This accuracy surrogate is anchored to a patient-relevant outcome via two independent, high-weight linkages: the AJCC staging evidence base (Gershenwald 2017), quantifying the stage-to-MSS gradient in melanoma; and the surgical-timing literature (Conic 2018), quantifying the hazard of time-to-surgery delay.

Severity scoring (Domain 2) provides the parallel causal chain for chronic inflammatory disease. EU and US regulators have adopted PASI, IGA, EASI, SCORAD, SALT and HiSCR as primary efficacy endpoints (EMA 2004 CHMP/EWP/2454/02; Simpson 2016 NEJM; King 2022 NEJM), with the HOME core outcome set (Schmitt 2014) and the European treat-to-target consensus (Mrowietz 2011) explicitly coupling severity-score change to PRO and DLQI improvement. Manual inter-rater variability (Fink 2018; Gourraud 2012) creates the clinical headroom for automated AI scoring (Schaap 2022; Huang 2023) to improve the fidelity of the treatment-decision surrogate, and thereby feed into the durable disease-control and HRQoL outcomes that PASI / EASI / SCORAD improvement predicts (Mattei 2014 r² = 0.80 at trial-arm level).

Referral optimisation (Domain 3) provides the system-level instantiation. Teledermatology RCTs and large cohorts (Eminović 2009; Whited 2013; Armstrong 2018; Moreno-Ramirez 2007; Giavina-Bianchi 2020) demonstrate equivalent clinical outcomes, reduced waiting times and preserved referral accuracy at lower cost. AI-triage-specific studies (Jain 2021) extend the same evidence to AI-assisted non-specialist workflows, and cost-analyses (Snoswell 2016) close the health-economic loop. The patient-outcome linkage runs via Conic 2018 for skin-cancer pathways.

Together, the three domains support the Pillar 1 chain for a Class IIb AI-dermatology CDS:

AI-derived diagnostic accuracy and severity scores feed correct, faster, more equitable referral and treatment decisions, which — through regulator-accepted and epidemiologically anchored surrogate-to-outcome links — produce earlier-stage detection (reduced melanoma morbidity and mortality), better-controlled chronic disease (improved PROs and HRQoL), and improved equitable access (reduced waiting times, non-inferior outcomes at lower cost).

This closes MDCG 2020-1 Pillar 1 for each of the three declared clinical benefits.

8. Limitations and residual uncertainty

Four gaps materially constrain the strength of the surrogate-to-outcome inference; each is declared here and addressed in the PMCF plan (R-TF-007-002) or in the limitations section of the CER (R-TF-015-003).

No AI-dermatology RCT with hard-outcome endpoints. No peer-reviewed study demonstrates that deployment of an AI dermatology CDS reduces melanoma-specific mortality or drives stage-shift as a primary endpoint, as of the search date (April 2026). The Pillar 1 chain therefore relies on bridging evidence — accuracy equivalence (Domain 1) plus the AJCC staging gradient and surgical-timing hazard (outcome anchors) — rather than a direct mortality endpoint. This is the canonical surrogate-endpoint argument permitted under MDCG 2020-1 for Class IIb software but must be declared as indirect.
Phototype-stratified generalisability is under-evidenced. Daneshjou 2022 quantifies 27–36 % AUC drops on FST V–VI relative to benchmark datasets; prospective NHS evaluations (Marsden 2023 DERM-003, with 2.2 % FST IV–VI) are underpowered in dark skin — meaning the surrogate-to-outcome chain is least well-validated in exactly the populations most at risk of diagnostic delay. This is the single most important declared PMCF commitment: stratified performance monitoring across Fitzpatrick groups.
Long-term automated-severity-scoring outcome data are thin. Analytic-validity evidence (Schaap 2022; Huang 2023) is strong; however, no prospective study has linked automated PASI / EASI / SCORAD deployment to durable real-world DLQI / POEM improvement or reduced treatment-escalation events. The CER treats this as a declared residual risk addressed by post-market outcome capture.
AI-triage evidence lags teledermatology evidence. The dominant Finnane, Snoswell, Moreno-Ramirez, Whited and Armstrong evidence base tests human-teledermatologist workflows, not autonomous or semi-autonomous AI triage. AI-triage-specific prospective evidence (Jain 2021; Liu 2020) is limited to diagnostic-uplift studies rather than pragmatic effectiveness trials with hard endpoints. PMCF will include comparative-effectiveness evaluation against the teledermatology standard of care where feasible.

These declared gaps do not invalidate the claimed Valid Clinical Association — they define the evidentiary boundaries of the surrogate-to-outcome inference and the parameters of the PMCF plan.

9. Conclusion

Each of the three surrogate-endpoint families on which the declared clinical benefits rest is an accepted proxy for patient-relevant outcomes in the peer-reviewed dermatology literature:

Diagnostic accuracy (7GH) is the canonical primary endpoint for dermatology AI across landmark reader studies and two systematic-review-grade meta-analyses, anchored to melanoma-specific survival via the AJCC stage-at-detection evidence base and to overall survival via surgical-timing cohort evidence.
Severity scoring (5RB) is the regulator-accepted primary endpoint for psoriasis (EMA 2004, PASI / IGA), atopic dermatitis (HOME 2014, EASI / IGA; Simpson 2016 dupilumab pivotal trials), alopecia areata (Olsen 2004, SALT; King 2022 baricitinib pivotal trials) and acne (FDA acne guidance, IGA / GAGS), with a quantified PASI ↔ DLQI trial-arm linkage of r² = 0.80.
Referral optimisation / care-pathway (3KX) is a widely accepted endpoint class in HTA evaluations and systematic reviews, with cluster-RCT-level evidence (Eminović 2009) for referral reduction, equivalency-RCT evidence (Armstrong 2018) for chronic-disease outcome preservation, and large real-world evidence (Giavina-Bianchi 2020; Moreno-Ramirez 2007) for waiting-time reduction.

Residual uncertainty is declared and addressed through the PMCF plan and the CER limitations section. The three surrogate-endpoint families together constitute the Pillar 1 Valid Clinical Association evidence for the device under MDCG 2020-1 §4.4.

10. Reference list

Domain 1 — Diagnostic accuracy

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–118. doi:10.1038/nature21056
Haenssle HA, Fink C, Schneiderbauer R, et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann Oncol. 2018;29(8):1836–1842. doi:10.1093/annonc/mdy166
Haenssle HA, Fink C, Toberer F, et al. Man against machine reloaded: performance of a market-approved convolutional neural network in classifying a broad spectrum of skin lesions. Ann Oncol. 2020;31(1):137–143. doi:10.1016/j.annonc.2019.10.013
Tschandl P, Rinner C, Apalla Z, et al. Human–computer collaboration for skin cancer recognition. Nat Med. 2020;26(8):1229–1234. doi:10.1038/s41591-020-0942-0
Liu Y, Jain A, Eng C, et al. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26(6):900–908. doi:10.1038/s41591-020-0842-3
Dick V, Sinz C, Mittlböck M, Kittler H, Tschandl P. Accuracy of computer-aided diagnosis of melanoma: a meta-analysis. JAMA Dermatol. 2019;155(11):1291–1299. doi:10.1001/jamadermatol.2019.1375
Salinas MP, Sepúlveda J, Hidalgo L, et al. A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. NPJ Digit Med. 2024;7(1):125. doi:10.1038/s41746-024-01103-x
Winkler JK, Blum A, Kommoss K, et al. Assessment of diagnostic performance of dermatologists cooperating with a convolutional neural network in a prospective clinical study: human with machine. JAMA Dermatol. 2023;159(6):621–627. doi:10.1001/jamadermatol.2023.0905
Gershenwald JE, Scolyer RA, Hess KR, et al. Melanoma staging: evidence-based changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J Clin. 2017;67(6):472–492. doi:10.3322/caac.21409
Conic RZ, Cabrera CI, Khorana AA, Gastman BR. Determination of the impact of melanoma surgical timing on survival using the National Cancer Database. J Am Acad Dermatol. 2018;78(1):40–46.e7. doi:10.1016/j.jaad.2017.08.039
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8(32):eabq6147. doi:10.1126/sciadv.abq6147 (balancing)
Han SS, Kim MS, Lim W, Park GH, Park I, Chang SE. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J Invest Dermatol. 2018;138(7):1529–1538. doi:10.1016/j.jid.2018.01.028 (balancing)
Freeman K, Dinnes J, Chuchu N, et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. BMJ. 2020;368:m127. doi:10.1136/bmj.m127 (balancing)

Domain 2 — Severity scoring

European Medicines Agency, CHMP. Guideline on clinical investigation of medicinal products indicated for the treatment of psoriasis. CHMP/EWP/2454/02 corr., adopted 2004.
Schmitt J, Spuls PI, Thomas KS, et al.; HOME Initiative. The Harmonising Outcome Measures for Eczema (HOME) statement to assess clinical signs of atopic eczema in trials. J Allergy Clin Immunol. 2014;134(4):800–807. doi:10.1016/j.jaci.2014.07.043
Simpson EL, Bieber T, Guttman-Yassky E, et al.; SOLO 1 and SOLO 2 Investigators. Two phase 3 trials of dupilumab versus placebo in atopic dermatitis. N Engl J Med. 2016;375(24):2335–2348. doi:10.1056/NEJMoa1610020
King B, Ohyama M, Kwon O, et al. Two phase 3 trials of baricitinib for alopecia areata. N Engl J Med. 2022;386(18):1687–1699. doi:10.1056/NEJMoa2110343
Olsen EA, Hordinsky MK, Price VH, et al. Alopecia areata investigational assessment guidelines — Part II. J Am Acad Dermatol. 2004;51(3):440–447. doi:10.1016/j.jaad.2003.09.032
Mattei PL, Corey KC, Kimball AB. Psoriasis Area Severity and Index (PASI) and the Dermatology Life Quality Index (DLQI): the correlation between disease severity and psychological burden in patients treated with biological therapies. J Eur Acad Dermatol Venereol. 2014;28(3):333–337. doi:10.1111/jdv.12106
Mrowietz U, Kragballe K, Reich K, et al. Definition of treatment goals for moderate to severe psoriasis: a European consensus. Arch Dermatol Res. 2011;303(1):1–10. doi:10.1007/s00403-010-1080-1
Fink C, Alt C, Uhlmann L, Klose C, Enk A, Haenssle HA. Intra- and interobserver variability of image-based PASI assessments in 120 patients suffering from plaque-type psoriasis. J Eur Acad Dermatol Venereol. 2018;32(8):1314–1319. doi:10.1111/jdv.14960
Schaap MJ, Cardozo NJ, Patel A, de Jong EMGJ, van Ginneken B, Seyger MMB. Image-based automated Psoriasis Area Severity Index scoring by Convolutional Neural Networks. J Eur Acad Dermatol Venereol. 2022;36(1):68–75. doi:10.1111/jdv.17711
Huang Y, Wei Q, Li Y, et al. Artificial Intelligence–Based Psoriasis Severity Assessment: Real-World Study With PASI as a Benchmark. JMIR Dermatol. 2023;6:e44932. doi:10.2196/44932

Domain 3 — Referral optimisation / care-pathway

Eminović N, de Keizer NF, Wyatt JC, et al. Teledermatologic consultation and reduction in referrals to dermatologists: a cluster randomized controlled trial. Arch Dermatol. 2009;145(5):558–564. doi:10.1001/archdermatol.2009.44
Whited JD, Warshaw EM, Kapur K, et al. Clinical course outcomes for store and forward teledermatology versus conventional consultation: a randomized trial. J Telemed Telecare. 2013;19(4):197–204. doi:10.1177/1357633X13487116
Armstrong AW, Chambers CJ, Maverakis E, et al. Effectiveness of online vs in-person care for adults with psoriasis: a randomized clinical trial. JAMA Netw Open. 2018;1(6):e183062. doi:10.1001/jamanetworkopen.2018.3062
Finnane A, Dallest K, Janda M, Soyer HP. Teledermatology for the diagnosis and management of skin cancer: a systematic review. JAMA Dermatol. 2017;153(3):319–327. doi:10.1001/jamadermatol.2016.4361
Chuchu N, Dinnes J, Takwoingi Y, et al. Teledermatology for diagnosing skin cancer in adults. Cochrane Database Syst Rev. 2018;12(12):CD013193. doi:10.1002/14651858.CD013193
Giavina-Bianchi M, Santos AP, Cordioli E. Teledermatology reduces dermatology referrals and improves access to specialists. eClinicalMedicine. 2020;29–30:100641. doi:10.1016/j.eclinm.2020.100641
Moreno-Ramirez D, Ferrándiz L, Nieto-García A, et al. Store-and-forward teledermatology in skin cancer triage: experience and evaluation of 2,009 teleconsultations. Arch Dermatol. 2007;143(4):479–484. doi:10.1001/archderm.143.4.479
Snoswell C, Finnane A, Janda M, Soyer HP, Whitty JA. Cost-effectiveness of store-and-forward teledermatology: a systematic review. JAMA Dermatol. 2016;152(6):702–708. doi:10.1001/jamadermatol.2016.0525
Jain A, Way D, Gupta V, et al. Development and assessment of an artificial intelligence–based tool for skin condition diagnosis by primary care physicians and nurse practitioners in teledermatology practices. JAMA Netw Open. 2021;4(4):e217249. doi:10.1001/jamanetworkopen.2021.7249

Per-reference CRIT1–7 appraisal is maintained in references/<domain>/<first-author-year-keyword>.md. Rolling appraisal table is in appraisal-log.md. Propagation of this review into audit-visible documents is mapped in integration-map.md.

1. Purpose and scope​

1.1 Why this review exists​

1.2 Scope​

1.3 What the review is and is not​

2. Regulatory framework​

2.1 MDCG 2020-1 Pillar 1 — Valid Clinical Association​

2.2 MDCG 2020-6 — sufficient clinical evidence​

2.3 MDR Annex I §6.1(a) and (b)​

2.4 MEDDEV 2.7/1 Rev 4 literature-review methodology​

3. Methodology​

3.1 Search strategy​

3.2 Inclusion and exclusion​

3.3 Appraisal — CRIT1–7​

3.4 What each domain must establish​

3.5 Evidence-base size​

4. Surrogate domain 1 — Diagnostic accuracy (benefit 7GH)​

4.1 Accepted-surrogate claim​

4.2 Directional claim​

4.3 Quantitative claim​

4.4 Magnitude of clinical importance in the device's context​

5. Surrogate domain 2 — Severity scoring (benefit 5RB)​

5.1 Accepted-surrogate claim​

5.2 Directional claim​

5.3 Quantitative claim​

5.4 Magnitude of clinical importance in the device's context​

6. Surrogate domain 3 — Referral optimisation / care-pathway (benefit 3KX)​

6.1 Accepted-surrogate claim​

6.2 Directional claim​

6.3 Quantitative claim​

6.4 Magnitude of clinical importance in the device's context​

7. Cross-domain synthesis — the causal pathway​

8. Limitations and residual uncertainty​

9. Conclusion​

10. Reference list​

Domain 1 — Diagnostic accuracy​

Domain 2 — Severity scoring​

Domain 3 — Referral optimisation / care-pathway​