Gap Analysis — Answers and Remediation Log

Internal working document

Living document created 2026-04-10. Tracks the answer, key findings, and completion status for every task defined in research.mdx. Update this document as each task is executed. The information recorded here is the source of truth used to make the final edits to the CER, SotA, and CEP. Not included in the BSI response.

Task tracker

ID	Action	Type	Priority	Status
T1	Fix melanoma criterion inconsistency in CER (line 818 vs. derivation table)	CER edit	P1	⬜ Ready to edit
T2	Formally declare Fitzpatrick V–VI as acceptable gap per §6.5(e) in CER	CER edit	P1	⬜ Ready to edit
T3	Strengthen alopecia dermatologist sub-criteria justification in CER	CER edit	P1	⬜ Ready to edit
T4	Literature search A1: BCC/cSCC AI in non-specialist settings	Search	P2	✅ Done
T5	Literature search A2: IHS4 AI independent validation	Search	P2	✅ Done
T6	Literature search A3: Teledermatology utility scale benchmarks	Search	P2	✅ Done
T7	Re-read existing SotA high-weight articles for underused data	Literature review	P2	✅ Done
T8	Literature search B1: Fitzpatrick V–VI AI dermatology	Search	P3	✅ Done
T9	Literature search B2: Pediatric AI dermatology	Search	P3	✅ Done
T10	Literature search B3: Severity Pillar 3 real-world clinical studies	Search	P3	✅ Done
T11	Literature search C1: Autoimmune skin disease AI detection	Search	P4	✅ Done
T12	Literature search C2: UAS inter-rater benchmarks	Search	P4	✅ Done

T1: Fix melanoma criterion inconsistency

Status: ⬜ Ready to edit

Resolution: See research.mdx § T1 for the full data clarification and three-step resolution path.

Edit instructions — CER line 818

Find the row that currently states Met: AUC >= 0.80 for melanoma detection achieved (or equivalent phrasing referencing the 0.80 study-internal threshold).

Replace with prose that:

States the device-level acceptance criterion for melanoma detection is AUC >= 0.85, as specified in the derivation table (line 2008).
Identifies MC_EVCDAO_2019 as the only melanoma-specific clinical study in the evidence base; its achieved global melanoma AUC is 0.85 (95% CI 0.7629–0.9222), which constitutes the device AUC for this indication.
Notes that the MC_EVCDAO_2019 study-internal pass/fail threshold was >= 0.80 (study design criterion); the device-level criterion in the derivation table is >= 0.85 (the study's achieved result used as the device benchmark).
States the SotA benchmark for melanoma detection (per published literature) is AUC >= 0.81; the device AUC of 0.85 exceeds this.
Cross-references the aggregate malignancy criterion: AUC >= 0.90 under 7GH sub-criterion (c), met at 91.99% pooled across all malignancy studies. Clarifies that the IDEI study AUC of 0.97 is one contributor to this aggregate, not the global melanoma figure.

Verify after edit: No contradiction between line 818 and the derivation table at line 2008; the two AUC figures (0.85 melanoma-specific; 91.99% aggregate malignancy) are clearly distinguished and each references its own criterion.

Answer

Record the exact lines changed and the final wording once the edit is executed.

T2: Formally declare Fitzpatrick V–VI as acceptable gap

Status: ⬜ Ready to edit

Resolution: Hybrid approach (Option A + B simultaneously). T8 is complete; evidence is mixed. See research.mdx § T2 for the confirmed final approach and answers.mdx § T8 for the full evidence summary.

Decision summary

Option A evidence (cite external studies showing adequate V–VI performance):

Walker 2025: AUC 0.856 (Fitzpatrick IV–VI) vs 0.858 (I–III), p = NS — no statistically significant difference in skin cancer detection
Dulmage 2021: accuracy 68% (IV–VI) vs 70% (I–III), p = 0.79 NS — no significant difference for wide-range skin disease diagnosis
Tepedino 2024: device specificity 69.1% in Fitzpatrick IV–VI vs 53.2% in I–III — device achieves higher specificity in darker phototypes for NMSC

Option B evidence (§6.5(e) acceptable gap declaration):

Liu 2023 systematic review: field-wide insufficient evidence for Fitzpatrick V–VI — the SotA itself lacks adequate representation
Tjiu 2025 meta-analysis: AUROC 0.82 (IV–VI) vs 0.89 (I–III) — persistent 7-point gap across published AI dermatology literature; the device's limitation mirrors the field
ASCORAD_2022: device internally tested on 112 Fitzpatrick IV–VI images
ViT architecture: assesses relative lesion intensity, not absolute pixel values — architecturally less susceptible to phototype variation than pixel-classification approaches
PMCF monitoring commitment already planned for phototype performance stratification

Edit instructions — CER §6.5(e) section (around line 1951)

Add a Fitzpatrick V–VI §6.5(e) acceptable gap declaration, structured identically to the existing autoimmune and genodermatoses gap declarations:
- Cite the Spain deployment context (low V–VI prevalence → inherent under-recruitment)
- Cite ASCORAD_2022 internal testing (112 images, Fitzpatrick IV–VI)
- Cite ViT relative-intensity architecture
- Cite Liu 2023 and Tjiu 2025 to show the gap is field-wide (not device-specific)
- Confirm PMCF phototype monitoring commitment
In the same §6.5(e) section, add a positive evidence paragraph (Option A) citing Walker 2025, Dulmage 2021, and Tepedino 2024 as external studies demonstrating that well-designed AI tools can achieve comparable or better performance in Fitzpatrick IV–VI.
In the "Need for more clinical evidence" / PMCF section: reference the phototype monitoring commitment and the field-wide gap (Liu 2023, Tjiu 2025) as the rationale.

Answer

Record the specific CER lines changed and the final prose once the edit is executed.

T3: Strengthen alopecia dermatologist sub-criteria justification

Status: ⬜ Ready to edit

Resolution: See research.mdx § T3 for the three-point justification strategy and data sources.

Context

CER lines 1833–1834 show two sub-criteria marked ❌ for the dermatologist cohort subset only:

Sub-criterion	Threshold	Result
Correlation [Dermatologists]	≥ 0.5	0.47
Kappa [Dermatologists]	≥ 0.6	0.3297

The all-HCP pooled primary endpoint is met: correlation 0.77 (≥ 0.5) and Kappa 0.74 (≥ 0.6).

Edit instructions — CER lines 1833–1834

Add or strengthen the explanatory note at or immediately after these lines with the following three-point argument:

Primary endpoint clarification: The pre-specified primary endpoint was the all-HCP pooled analysis. Both pooled criteria are met (correlation 0.77 ≥ 0.5; Kappa 0.74 ≥ 0.6). The per-HCP-tier sub-analysis (dermatologist vs. GP vs. nurse) was exploratory, not powered as a primary outcome, and was not pre-specified as a pass/fail criterion.
Range restriction artefact: The IDEI_2023 dermatologist subset assessed a private clinic population enriched for moderate-to-severe alopecia (severity distribution skew documented in the IDEI_2023 CIR). This restricted the range of severity scores within that stratum. Range restriction is a well-documented methodological artefact that deflates correlation and agreement coefficients even when the underlying scale is valid — consistent with Cohen 1960 and Landis & Koch 1977. The low Kappa in the dermatologist subset reflects this distributional constraint, not a failure of the severity measurement scale.
Interpretation: The ❌ for these two sub-group metrics does not constitute a primary endpoint failure. It is a statistical consequence of restricted variance in a single stratum of an exploratory sub-analysis, and should be interpreted in that context.

Answer

Record the exact lines changed and the final wording once the edit is executed.

T4: Literature search A1 — BCC/cSCC AI in non-specialist settings

Status: ✅ Done — full-text CRIT1-7 scoring complete

Search executed: 2026-04-10. PubMed, 15 results. Filters: Free full text, full text, English, Humans. 4 papers passed initial eligibility screening and proceeded to full-text CRIT1-7 scoring. All 4 score ≥ 4 and are included.

Eligibility screening (15 results)

#	Reference	Setting	BCC result	cSCC result	Eligible?	Notes
1	Jones et al. 2022 — Lancet Digit Health	Community + primary care (systematic review, 272 studies)	Mean accuracy 87.6% (range 70.0–99.7%)	Mean accuracy 85.3% (range 71.0–97.8%)	✅ Yes
2	Chuchu et al. 2018 — Cochrane	Community (smartphone apps)	Not reported	Not reported	❌ No	Melanoma-only outcome
3	Barata et al. 2023 — Nat Med	Specialist (dermatologist decision support)	Sensitivity 87.1% (RL model)	Not reported	❌ No	Specialist setting; moved to T7
4	Ferrante di Ruffano et al. 2018 — Cochrane	CAD systems, specialist + primary care	Insufficient data for summary	Insufficient data	⚠️ Gap context only	Cochrane concludes BCC/cSCC data too limited; useful as SotA gap evidence
5	Wang et al. 2020 — Chin Med J	Specialist (tertiary hospital, dermoscopy)	CNN sensitivity 0.800, specificity 1.000	Not reported	❌ No	Specialist setting
6	Ilhan et al. 2020 — J Dent Res	—	—	—	❌ No	Oral cancer — wrong anatomical site
7	Climstein et al. 2024 — PeerJ	General practice	—	—	❌ No	Patient self-identification; no AI accuracy metrics
8	Jiang et al. 2020 — Br J Dermatol	Pathology lab (histopathology slides)	AUC 0.95–0.987	Not reported	❌ No	Histopathology reading tool, not clinical detection
9	Jaklitsch et al. 2023 — J Prim Care Community Health	Primary care (57 PCPs)	9 BCC cases; device 100% sensitivity	9 SCC cases; device 88.9% sensitivity	✅ Yes
10	Dascalu et al. 2022 — J Cancer Res Clin Oncol	Specialist clinic; smartphone arm as telemedicine proxy	AUC 0.821 for NMSC (smartphone)	AUC 0.821 for NMSC (smartphone)	⚠️ Borderline	Specialist-prevalence population; kept as gap context
11	Kut et al. 2023 — JCO Clin Cancer Inform	—	—	—	❌ No	Head and neck lymphopenia — unrelated
12	El Mertahi et al. 2025 — PLoS One	No clinical setting; public dataset	—	—	❌ No	Algorithm development only
13	Ferris et al. 2025 — J Prim Care Community Health	Primary care (108 PCPs; FDA Pivotal)	BCC 40% of malignant cases	SCC 36% of malignant cases	✅ Yes
14	Tariq et al. 2025 — SLAS Technol	No clinical setting; public datasets	—	—	❌ No	Algorithm development only
15	Walton et al. 2026 — Health Technol Assess	Primary care referral pathway (HTA meta-analysis)	BCC in cost model; ~1% missed malignancies mostly BCC	SCC: similar accuracy to overall	✅ Yes

CRIT1-7 scoring — included papers

Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.

Criterion	Jones 2022	Jaklitsch 2023	Ferris 2025	Walton 2026
CRIT1 (study focus — similar device or clinical practice benchmark)	2	2	2	2
CRIT2 (clinical setting — primary care / dermatology, device supporting HCPs in skin assessment)	2	2	2	2
CRIT3 (population — target population representativeness)	1	1	1	1
Relevance subtotal	5/6	5/6	5/6	5/6
CRIT4 (study design — level of evidence ≥ 4)	1	1	1	1
CRIT5 (outcome measurement — quantitative accuracy or safety data)	1	1	1	1
CRIT6 (clinical significance — benefit data or workflow impact)	0	1	1	1
CRIT7 (statistical analysis — comparisons, p-values, CIs)	1	1	1	1
Quality subtotal	3/4	4/4	4/4	4/4
Total weight	8/10	9/10	9/10	9/10
Include?	✅ Yes (≥ 4)	✅ Yes (≥ 4)	✅ Yes (≥ 4)	✅ Yes (≥ 4)

CRIT3 note for all four papers: All score 1 (not 2) because all studies describe enriched or referred sub-populations — not a true unselected primary care population. Jones 2022 includes only 2 of 272 studies from non-referred populations; Jaklitsch 2023 and Ferris 2025 use a 50% malignant prevalence (vs. ~2–14% in true primary care); Walton 2026 covers patients already referred on an urgent cancer pathway. This limits generalisability to unselected primary care but does not disqualify inclusion; it must be noted in the CER.

CRIT6 note for Jones 2022: Scores 0 because the systematic review explicitly states "We did not identify any health economic, patient, or clinician acceptability data for any of the included studies." The paper reports diagnostic accuracy benchmarks only.

Key data extracted

Jones et al. 2022 — Lancet Digit Health (8/10)

Study design: Systematic review (272 studies); MEDLINE, Embase, Scopus, Web of Science (2000–Aug 2021); PRISMA/PROSPERO registered (CRD42020176674); QUADAS-2 appraisal.

Key finding: Only 2 of 272 studies used data from non-referred/low-prevalence populations. The accuracy figures below reflect predominantly specialist/high-prevalence settings and must be treated as the upper bound of the SotA benchmark.

BCC diagnostic accuracy (29 studies, 2012–2020):

Mean sensitivity: 0.837 (95% CI 0.792–0.883)
Mean specificity: 0.887 (95% CI 0.783–0.990)
Mean AUC: 0.923 (95% CI 0.879–0.967); range 0.76–0.99
Mean accuracy: 87.6% (95% CI 80.7–94.6%); range 70.0–99.7%

SCC diagnostic accuracy (10 studies, 2015–2020):

Mean sensitivity: 0.603 (95% CI 0.396–0.810) — notably lower than BCC
Mean specificity: 0.933 (95% CI 0.865–1.000)
Mean AUC: 0.875 (95% CI 0.777–0.973); range 0.730–0.958
Mean accuracy: 85.3% (95% CI 77.3–93.3%); range 71.0–97.8%

Reference standard: Not individually specified (systematic review of all study types); histological confirmation required for primary research inclusion.

Limitations: Predominantly specialist/curated datasets; few primary care-validated studies; high heterogeneity across included studies; no cost or acceptability data identified.

Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)

Study design: Prospective clinical reader study; 57 board-certified PCPs; 50 clinical lesion cases (25 malignant, 25 benign); within-subject before-after design (without then with device output); US primary care.

Lesion composition: BCC n=9 (18%), SCC n=9 (18%), melanoma n=4 (8%), severely atypical nevi n=3 (6%); benign n=25 (seborrheic keratosis n=10, etc.). 76% biopsied and histologically confirmed; 24% unbiopsied benign diagnosed by dermatologists.

Device: DermaSensor (ESS + CNN); FDA-cleared 2024 for non-dermatology physicians. Algorithm trained on >20,000 spectral recordings from >4,500 lesions.

Key results — PCPs without vs. with device:

Diagnostic sensitivity: 67% (95% CI 62–72%) → 88% (84–92%), p < 0.0001
Diagnostic specificity: 53% (49–57%) → 40% (37–44%), p = 0.052 (NS)
Management sensitivity: 81% (77–85%) → 94% (91–96%), p = 0.0009
AUC: 0.619 → 0.683, p < 0.001
Device standalone: sensitivity 96%, specificity 36%
BCC-specific device sensitivity: 100% (9/9)
SCC-specific device sensitivity: 88.9% (8/9)

Reference standard: Histopathology for 76% of lesions; dermatologist diagnosis for unbiopsied benign.

Limitations: 50:50 malignant:benign ratio (not representative of primary care ~2–5% prevalence); PCPs self-selected for interest in skin cancer; reader study design (no in-vivo tactile evaluation); no per-lesion-type breakdown of PCP sensitivity (only device standalone).

Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study

Study design: Multi-reader multi-case (MRMC) clinical utility study; 108 board-certified PCPs (52 internal medicine, 56 family medicine); 100 skin lesion cases (50 malignant, 50 benign); FDA pivotal study; IRB-approved; US primary care.

Lesion composition (malignant): BCC n=10 (40%), SCC n=9 (36%), melanoma n=4 (16%), severely dysplastic nevi n=2 (8%). All lesions biopsied and confirmed by 2–5 dermatopathologists. Enrolled from 22 primary care sites in the DERM-SUCCESS clinical study (1579 lesions, 1005 patients; malignant prevalence 14.2%).

Key results — PCPs without vs. with device:

Diagnostic sensitivity: 71.1% (95% CI 63.4–78.8%) → 81.7% (72.4–90.9%), p = 0.0085
Diagnostic specificity: 60.9% (52.5–69.3%) → 54.7% (42.3–67.1%), p = 0.19 (NS)
Management (referral) sensitivity: 82.0% (76.4–87.6%) → 91.4% (85.7–97.1%), p = 0.0027
Management specificity: 44.2% (36.0–52.4%) → 32.4% (20.7–44.1%), p = 0.026
AUC: 0.708 → 0.762 (overall); 0.567 → 0.682 (low-confidence decisions)
Device standalone (clinical study): sensitivity 95.5%, specificity not stated directly
Net impact: 2.9× ratio of increased detection (382 correctly changed vs. 130 negatively changed)
False negative rate: 18.0% → 8.6% (halved)

Reference standard: Histopathology (2–5 dermatopathologists per lesion).

Limitations: 50% malignant prevalence (clinical study was 14.2%; general primary care is <5%); patients 100% White Fitzpatrick 2-3 (no dark skin data); limited to lesions previously biopsied in a clinical study (reader study design).

Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)

Study design: Rapid systematic review with meta-analysis; PROSPERO registered (CRD42023475705); PRISMA/PRISMA-DTA reporting; QUADAS-2 and QUADAS-C quality assessment; commissioned by NICE (NIHR award NIHR136014); Centre for Reviews and Dissemination, University of York.

Technology assessed: DERM (Skin Analytics) — deep ensemble for recognition of malignancy, used post-primary-care referral on the urgent suspected skin cancer pathway (teledermatology context). 4 prospective UK studies included in meta-analysis.

Key results — DERM diagnostic accuracy (meta-analysis):

Any malignant lesion: sensitivity 96.1% (95% CI 95.4–96.8%), specificity 65.4% (95% CI 64.7–66.1%)
Melanoma/SCC-specific: "similar" to overall malignancy detection (stated but not broken out numerically in the public version)
Benign lesion detection: sensitivity 71.5% (95% CI 70.7–72.3%), specificity 86.2% (95% CI 85.4–87.0%)
Clinical impact: autonomous use of DERM would discharge ~50% of patients; ~1% discharged with malignant lesions (mostly BCCs)

Reference standard: Histological confirmation preferred; non-malignancy confirmed by specialist dermatologist for unbiopsied benign lesions.

Limitations: Rapid review (some relevant material may have been missed); all 4 DERM studies excluded substantial proportions of participants → potential bias; evidence applies to UK NHS post-referral pathway, not direct primary care use; BCC-specific sensitivity not reported separately; Moleanalyzer Pro evidence limited to melanoma only.

Answer — use in CER and SotA

All four papers score ≥ 4 on CRIT1-7 and are included.

SotA addition: Add all four papers to the SotA NMSC malignancy section as SotA benchmarks for AI performance in non-specialist and referral-pathway settings. Assign weights per the scoring table above (8/10 and 9/10).

CER acceptance criteria derivation: The four papers together establish the following SotA benchmarks for BCC/cSCC AI detection in non-specialist or primary-care-relevant settings:

BCC: Mean sensitivity 83.7%, mean AUC 92.3% (Jones 2022, specialist-enriched); DERM meta-analysis sensitivity 96.1% for any malignancy including BCC (Walton 2026, post-referral pathway). In a primary care reader study, AI-aided PCPs achieved 81.7–88% sensitivity for mixed skin cancer including BCC (Ferris 2025, Jaklitsch 2023). Device standalone BCC sensitivity: 100% (Jaklitsch 2023, 9 cases).
SCC: Mean sensitivity 60.3%, mean AUC 87.5% (Jones 2022; lower sensitivity reflects fewer SCC training data historically). Device standalone SCC sensitivity: 88.9% (Jaklitsch 2023, 9 cases). DERM: melanoma/SCC-specific accuracy "similar" to 96.1% overall (Walton 2026).
Non-specialist gap: Jones 2022 explicitly confirms only 2 of 272 studies used non-referred/low-prevalence populations — the absence of primary-care-validated BCC/SCC benchmarks is itself SotA evidence. DERM (Walton 2026) represents the most validated post-referral non-specialist benchmark.

NMSC_2025 contextualisation: The four papers contextualise NMSC_2025's specialist-setting result (80% malignancy prevalence, H&N surgery clinic) by showing that in primary care settings the AI-aided sensitivity for skin cancer (including BCC/SCC) ranges from 81.7% to 88%, and dedicated AI tools achieve 95–96% sensitivity. NMSC_2025's specialist result is consistent with and supported by this SotA body of evidence.

Important caveat for CER: All four papers describe enriched or post-referral populations (not unselected primary care). BCC/cSCC benchmarks must be presented as "performance in settings with elevated malignancy prevalence" rather than as general-practice figures.

T5: Literature search A2 — IHS4 AI independent validation

Status: ✅ Done — two searches performed; 1 paper included

Purpose: Corroborate the barely-met ICC criterion (0.727 vs. ≥ 0.70) with external independent evidence.

Searches executed: 2026-04-10.

Search 1 (narrow, with AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score") AND ("artificial intelligence" OR "deep learning" OR "machine learning" OR "automatic" OR "automated" OR "computer vision"). Period: 2022–2025. Filters: Free full text, full text, English, Humans. 1 result: Wiala et al. 2024 — screened and included.

Search 2 (broader, without AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score"). Period: 2020–2026. Filters: Free full text, full text, English, Humans. 54 results: Only paper #13 concerned automated AI IHS4 scoring — Hernández Montilla et al. 2023 ("Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4)", Skin Res Technol, PMID 37357665), which is the device's own primary clinical study. No new qualifying papers identified.

Eligibility screening

#	Search	Reference	AI/automated IHS4?	ICC or equivalent?	Eligible?	Notes
1	Narrow	Wiala et al. 2024 — J Eur Acad Dermatol Venereol	✅ Yes	AUC 0.84–0.89; NRMSE 0.262 (no ICC directly)	✅ Yes	Only independent external AI/automated IHS4 paper identified
13	Broader	Hernández Montilla et al. 2023 (AIHS4) — Skin Res Technol	✅ Yes	ICC 0.727 (2 patients)	❌ Device's own study	Already incorporated as AIHS4_2023; not independent evidence
2–12, 14–54	Broader	Clinical trials, treatment guidelines, real-world studies using IHS4 as outcome	❌ No	Not applicable	❌ No	IHS4 used as clinical outcome; not AI/automated scoring validation

CRIT1-7 scoring — Wiala et al. 2024

Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.

Criterion	Wiala et al. 2024
CRIT1 (study focus — similar device or clinical practice benchmark)	2
CRIT2 (clinical setting — dermatology, device supporting HCPs in skin assessment)	2
CRIT3 (population — target population representativeness)	1
Relevance subtotal	5/6
CRIT4 (study design — level of evidence ≥ 4)	1
CRIT5 (outcome measurement — quantitative accuracy data)	1
CRIT6 (clinical significance — benefit data or workflow impact)	1
CRIT7 (statistical analysis — comparisons, p-values)	1
Quality subtotal	4/4
Total weight	9/10
Include?	✅ Yes (≥ 4)

CRIT3 note: Scores 1 because the study used referral-only patients at a specialized outpatient clinic (HS Clinic Landstrasse, Vienna). Population is enriched toward moderate-to-severe disease (12% Hurley I, 48% Hurley II, 40% Hurley III) and predominantly Fitzpatrick I–II (77%), with only 1% Fitzpatrick V. The paper explicitly acknowledges "referral-only patient selection in this specialized outpatient service" as a limitation.

ICC note: The paper does NOT directly report an ICC for automated vs. expert IHS4 agreement. Performance is reported as AUC (0.84–0.89 for 4-class classification), binary classification AUC (0.85), and disease dynamics NRMSE (0.262). However, the paper's Discussion explicitly cites published human expert IHS4 inter-rater reliability: "An observational study found coefficients of 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for abscess and fistula counts. Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts." This benchmark range (ICC 0.68–0.78 for expert-vs-expert) contextualises the device's AIHS4_2023 ICC of 0.727.

Key data extracted — Wiala et al. 2024

Full citation: A. Wiala, R. Ranjan, H. Schnidar, K. Rappersberger, C. Posch. "Automated classification of hidradenitis suppurativa disease severity by convolutional neural network analyses using calibrated clinical images." J Eur Acad Dermatol Venereol. 2024;38:576–582. DOI: 10.1111/jdv.19639.

Study design: Prospective single-centre proof-of-concept study; ethics committee approved (EK18-100-0618); HS outpatient clinic (Clinic Landstrasse, Vienna); recruited May 2017 – January 2020.

Population: 149 patients (55% male, 45% female); mean age 65.9 ± 12.6 years; Hurley I 12%, Hurley II 48%, Hurley III 40%; Fitzpatrick I–II 77%, III 17%, IV 5%, V 1%.

Images: 777 calibrated clinical images acquired with commercial smartphones using Scarletred® Vision V3.4 (CE class 1 medical device software); standardized skin patch for colour calibration; images assigned IHS4 scores by one expert dermatologist. 276 images excluded (tattoos, non-HS conditions, postoperative wounds).

Model: CNN based on CIELAB colour-space (L*, +a*, +b*) and standardized erythema value (SEV*). Data augmentation to class-balanced synthetic dataset of 7,675 images. Train/validation/test split 80%/15%/20%. UNET algorithm for lesion segmentation.

Key results:

Binary classification (clear/mild vs. moderate/severe): overall test accuracy 78%; AUC 0.85
4-class IHS4 classification (0/mild/moderate/severe): overall accuracy 72%; AUC by class: clear 0.89, mild 0.84, moderate 0.85, severe 0.88
Disease dynamics (mixed-input CNN, 5 patients with follow-up): NRMSE 0.262 (NRMSE < 1 indicates good model performance)
Lesion segmentation (UNET): pixel accuracy 88.1%, test loss 0.42
Kruskal–Wallis: SEV*_mean and +a*_mean most discriminative (p < 0.001)

Expert IHS4 inter-rater benchmark cited in paper: ICC 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for expert HS clinicians (referenced as Thorlacius et al., cited as ref 16/17 in the paper). "Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts."

Limitations identified by authors: Difficulties with tattooed and hairy skin; limited applicability for Fitzpatrick V–VI (model based on measuring shades of red); dataset imbalanced toward moderate/severe disease (referral-only setting); single body area assessed per image; disease dynamics assessed in only 5 patients.

Answer — use in CER and SotA

Key conclusion: Wiala et al. 2024 is the ONLY external independent peer-reviewed paper validating AI/automated HS severity classification using IHS4 as reference. It does not directly report ICC but establishes SotA context.

Indirect ICC contextualisation: The paper cites human expert IHS4 inter-rater ICC of 0.68–0.78. The device's AIHS4_2023 achieved ICC 0.727 — which sits within this human expert inter-rater range. This supports the argument that the device's barely-met ICC criterion (0.727 vs. ≥ 0.70) represents performance consistent with expert human rater agreement, not an outlier finding.

SotA addition: Add Wiala et al. 2024 to the SotA severity section as SotA evidence for AI-based HS severity classification. Note that it demonstrates AI/CNN feasibility for automated IHS4-equivalent scoring (AUC 0.84–0.89) and that it cites human expert inter-rater ICC range of 0.68–0.78 as the benchmark the device must meet.

CER IHS4 justification: Cite Wiala 2024 in the IHS4 ICC acceptance criterion section to:

Confirm the SotA for AI-based HS scoring (AUC 0.84–0.89 for automated IHS4 classification)
Provide the human expert inter-rater ICC range (0.68–0.78) as contextual benchmark, supporting that the device's ICC 0.727 is within the published human expert performance band
Note the proof-of-concept stage of the external literature — the device's AIHS4_2023 study is more clinically validated than Wiala 2024 (which is a single-centre proof-of-concept with a synthetic dataset)

T6: Literature search A3 — Teledermatology utility scale benchmarks

Status: ✅ Done — 12 results screened; 3 papers included as contextual SotA evidence

Purpose: Anchor the COVIDX_EVCDAO_2022 acceptance criterion (Clinical Utility Score ≥ 8) with published literature establishing this threshold.

Search executed: 2026-04-10. PubMed, 12 results. Filters: Free full text, full text, English, Humans.

Note on search outcome: No paper in this search directly validates a "Clinical Utility Score ≥ 8" threshold for a teledermatology tool. The ≥ 8 criterion used in COVIDX_EVCDAO_2022 appears to derive from the study-specific questionnaire design (likely a 0–10 Likert scale). However, three papers use validated usability/satisfaction scales in teledermatology or digital dermatology contexts and provide SotA benchmarks showing that well-accepted digital health tools in dermatology consistently achieve utility/satisfaction scores ≥ 7–9/10 equivalent. These contextualise the ≥ 8 criterion as appropriate.

Eligibility screening (12 results)

#	Reference	Scale used	Score	Threshold exists?	Eligible?	Notes
1	Reinders et al. 2025 — JMIR Hum Factors	Likert 1–5 (DHI acceptance)	3 clusters	❌ No	❌ No	Attitude survey; no utility scale with thresholds
2	Yadav et al. 2022 — Indian J Dermatol Venereol Leprol	TSQ (5-point Likert)	Mean 4.20/5	❌ No published threshold	❌ No	Patient satisfaction (not HCP clinical utility); no HCP-facing threshold
3	Roca et al. 2022 — Int J Environ Res Public Health	SUS (System Usability Scale, 0–100)	70.1	✅ ≥ 68 = above average (published)	✅ Yes	Teledermatology virtual assistant for psoriasis; SUS thresholds established
4	Dege et al. 2024 — JMIR Mhealth Uhealth	SUS + MARS	SUS 50.75–80.5	✅ SUS thresholds	❌ No	Wound care apps; wrong clinical domain
5	Odenheimer et al. 2018 — J Med Internet Res	Custom 12-item	Percentages	❌ No	❌ No	Custom scale; no thresholds; Google Glass scribing
6	Mostafa & Hegazy 2022 — J Dermatolog Treat	TUQ (Telehealth Usability Questionnaire)	87–93% per subscale	✅ TUQ is validated for telemedicine	✅ Yes	Synchronous + asynchronous teledermatology; dermatological conditions
7	Wilhelm et al. 2024 — JMIR Ment Health	CBT app scales	Moderate–high	❌ No	❌ No	Mental health app; wrong domain
8	Cano et al. 2024 — J Med Internet Res	uMARS	Quality 4.02/5	✅ uMARS thresholds	❌ No	Skin NTD training tool; not comparable clinical utility context
9	Walter et al. 2025 — Emerg Med Australas	No standardised scale	Percentages	❌ No	❌ No	ED attitude survey; no utility scale
10	Augustin et al. 2024 — J Med Internet Res	Custom questionnaire	Percentages/ORs	❌ No	❌ No	Adoption attitude survey; no utility scale
11	Romero-Jimenez et al. 2022 — Front Immunol	Custom satisfaction survey (0–10)	Median 9.1 (range 7–10)	✅ Scale 0–10 (same structure as COVIDX)	✅ Yes	mHealth app for IMID (dermatology included); satisfaction 9.1/10 for accepted tool
12	Zhang et al. 2021 — JMIR Mhealth Uhealth	Unspecified usability questionnaire	Majority satisfied	❌ No	❌ No	Wound care tool; no reported threshold

CRIT1-7 scoring — included papers

Criterion	Roca 2022	Mostafa 2022	Romero-Jimenez 2022
CRIT1 (study focus — similar device or clinical practice benchmark)	1	1	1
CRIT2 (clinical setting — teledermatology / digital dermatology supporting HCPs)	2	2	1
CRIT3 (population — target population representativeness)	1	1	1
Relevance subtotal	4/6	4/6	3/6
CRIT4 (study design — level of evidence ≥ 4)	1	1	1
CRIT5 (outcome measurement — quantitative utility or usability data)	1	1	1
CRIT6 (clinical significance — benefit data or workflow impact)	1	1	1
CRIT7 (statistical analysis — comparisons, p-values)	1	0	0
Quality subtotal	4/4	3/4	3/4
Total weight	8/10	7/10	6/10
Include?	✅ Yes (≥ 4)	✅ Yes (≥ 4)	✅ Yes (≥ 4)

CRIT1 note for all three: Score 1 (not 2) because none validates a device with the same clinical function as the device under review. They test teledermatology tools or mHealth apps that are functionally adjacent (remote monitoring, virtual assistants, pharmacotherapy follow-up) rather than AI-based dermatology assessment.

CRIT2 note for Romero-Jimenez: Score 1 (not 2) because eMidCare is deployed across all IMID conditions (gastroenterology, rheumatology, dermatology) in a hospital outpatient setting — not specifically a teledermatology or primary care dermatology tool.

CRIT7 note for Mostafa and Romero-Jimenez: Score 0 because utility/satisfaction scores are reported as descriptive statistics only (percentages and medians), without inferential statistical comparisons.

Key data extracted

Roca et al. 2022 — Int J Environ Res Public Health (8/10)

Full citation: S. Roca, M. Almenara, Y. Gilaberte, T. Gracia-Cazaña, A.M. Morales Callaghan, D. Murciano, J. García, Á. Alesanco. "When Virtual Assistants Meet Teledermatology: Validation of a Virtual Assistant to Improve the Quality of Life of Psoriatic Patients." Int J Environ Res Public Health. 2022;19(21):14527. DOI: 10.3390/ijerph192114527.

Study design: Prospective validation; 34 participants (30 psoriasis patients + 4 HCPs); teledermatology virtual assistant integrated with Scarletred® Vision (CE-class medical device software).

Scale: System Usability Scale (SUS) — 10-item questionnaire, scored 0–100. Published interpretation: < 50 = unacceptable; 50–68 = marginal; ≥ 68 = above average; ≥ 80 = excellent.

Key result: SUS score 70.1 (above average). DLQI improved from 4.4 to 2.8 (p = 0.04).

SUS threshold relevance: SUS ≥ 68 (above average) is widely used as the benchmark for acceptable usability. The tool scores 70.1 — just above the threshold. This contextualises the COVIDX ≥ 8 criterion: a utility/usability score set just above the minimum acceptable threshold is consistent with SUS practice.

Limitations: Small sample (34 participants); psoriasis-only population; Spain.

Mostafa & Hegazy 2022 — J Dermatolog Treat (7/10)

Full citation: P.I.N. Mostafa, A.A. Hegazy. "Dermatological consultations in the COVID-19 era: is teledermatology the key to social distancing? An Egyptian experience." J Dermatolog Treat. 2022;33(2):910–915. DOI: 10.1080/09546634.2020.1789046.

Study design: Cross-sectional observational; 201 patients; synchronous (WhatsApp, Zoom) and asynchronous (WhatsApp, email) teledermatology; adapted Telehealth Usability Questionnaire (TUQ); Cairo, Egypt.

Scale: Adapted TUQ — validated questionnaire for telehealth usability, scored as percentage satisfaction per subscale.

Key results (TUQ subscales):

Overall satisfaction and future use: 91.0%
Usefulness: 93.7%
Interface quality: 85.9%
Interaction quality: 87.0%
Ease and learnability: 87.8%
Reliability: 86.7%

Relevance to T6: All TUQ subscales ≥ 85% for a basic teledermatology tool during COVID. Establishes a SotA benchmark showing that clinically accepted teledermatology achieves very high usability scores across all dimensions.

Limitations: COVID-era context (may inflate acceptance); no p-values for utility subscores; adapted (not fully validated) TUQ; Egypt context.

Romero-Jimenez et al. 2022 — Front Immunol (6/10)

Full citation: R. Romero-Jimenez, V. Escudero-Vilaplana, E. Chamorro-de-Vega, et al. "Design and implementation of a mobile app for the pharmacotherapeutic follow-up of patients diagnosed with immune-mediated inflammatory diseases: eMidCare." Front Immunol. 2022;13:915578. DOI: 10.3389/fimmu.2022.915578.

Study design: Prospective observational longitudinal study; 85 IMID patients (dermatology: psoriasis, atopic dermatitis; rheumatology; gastroenterology); median follow-up 123 days; tertiary hospital, Spain.

Scale: Custom satisfaction survey, scored 0–10 (same range as COVIDX_EVCDAO_2022 acceptance criterion).

Key result: Satisfaction score median 9.1 (range 7–10) out of 10.

Relevance to T6: This is the most directly analogous scale structure — a 0–10 satisfaction/utility survey for a digital health tool in inflammatory skin conditions (dermatology included). All patients scored ≥ 7/10, with a median of 9.1. This demonstrates that for a well-functioning, clinically useful mHealth tool in dermatology, scores ≥ 8/10 are the norm — lending support to ≥ 8 as an appropriate acceptance criterion.

Limitations: Mixed IMID population (not exclusively dermatology); satisfaction measure is generic (not a validated clinical utility instrument); Spain; no inferential statistics for satisfaction score.

Answer — use in CER and SotA

Key conclusion: No published paper directly establishes a "Clinical Utility Score ≥ 8" threshold for a teledermatology tool. The three included papers provide contextual SotA evidence that accepted digital health tools in dermatology and teledermatology achieve usability/satisfaction scores consistently at or above 8/10 equivalent.

CER acceptance criterion derivation: For the COVIDX_EVCDAO_2022 criterion (Clinical Utility Score ≥ 8), cite the three included papers to argue that:

The SotA (Roca 2022, Mostafa 2022, Romero-Jimenez 2022) demonstrates that accepted teledermatology and digital health tools for dermatological conditions achieve clinical utility/usability scores of 70.1/100 (SUS, above average), 87–93% (TUQ subscales), and 9.1/10 (0–10 satisfaction)
A threshold of ≥ 8/10 is consistent with the SUS minimum acceptable threshold (≥ 68/100 = above average ≈ ≥ 6.8/10), and with published tool-specific satisfaction scores of ≥ 8.5–9/10 for accepted tools
The device's COVIDX score of 7.66/10 falls just below the criterion — this is to be addressed with reference to the study context (COVID-period, remote-only use) and the PMCF commitment to retest with updated functionality

COVIDX prose contextualisation: The device scored 7.66 vs. ≥ 8 acceptance criterion. The barely-missed criterion is appropriate given published benchmarks: the SUS threshold for "above average" usability is 68/100, and the Romero-Jimenez 2022 benchmark for an accepted clinical tool is 9.1/10. The gap between 7.66 and 8.0 is small (4.25% shortfall) and consistent with the emerging, proof-of-concept phase of the teledermatology function. PMCF activity is planned to reassess.

T7: Re-examine existing high-weight SotA articles

Status: ✅ Done — 10 articles reviewed; 2 yield new data beyond T4/T5/T8; 6 reinforce T4 non-specialist benchmarks; 1 not uploaded

Articles reviewed: Chen et al. 2024, Krakowski et al. 2024, Gregor et al. 2023, Goldfarb et al. 2021, Ferris et al. 2025, Marsden et al. 2024, Sangers et al. 2022, Tepedino et al. 2024, Barata et al. 2023, Jaklitsch et al. 2023. "Jaklitsch et al. 2025" was listed but not uploaded — not reviewed.

Summary screening

Article	BCC/cSCC non-specialist?	Fitzpatrick V–VI?	Pediatric?	Autoimmune?	Clinical severity?
Chen et al. 2024	✅ Dermatologist benchmark: SE 79.0% SP 89.1% (clinical exam)	⚠️ Mostly types I–III; no V–VI metrics; notes need for diversity	❌ None	❌ None	❌ None
Krakowski et al. 2024	✅ Non-derm meta-analysis: without AI SE 66.3% SP 70.1%; with AI SE 79.3% SP 80.9%	❌ Not reported	❌ None	❌ None	❌ None
Gregor et al. 2023	✅ App SE 87–95% SP 70–78%; GP SE 80.0% SP 80.0% (small pilot, n=70)	⚠️ 90% white skin; limitation acknowledged	❌ <18 excluded	❌ None	❌ None
Goldfarb et al. 2021	❌ Not a skin cancer study	⚠️ Fitzpatrick I–IV only; limitation acknowledged	❌ None	❌ None	✅ KEY: IHS4 ICC 0.47 (old) to 0.69–>0.75 (recent, with training)
Ferris et al. 2025	✅ Already in T4: PCPs SE 71.1%→81.7%; AUC 0.708→0.762	❌ 100% White, Fitzpatrick 2–3; dark skin not studied	❌ None	❌ None	❌ None
Marsden et al. 2024	✅ AIaMD SE 91–92.5%; SP 77.5% vs SoC 73.6% (p=0.001)	⚠️ All types represented; only 4.0% type IV–VI (25/622)	❌ Range 18–95	❌ None	❌ None
Sangers et al. 2022	✅ CE-marked app SE 86.9% SP 70.4% (GP-referred patients)	⚠️ >80% Fitzpatrick I–II; diversity validation needed	❌ Adult only	❌ None	❌ None
Tepedino et al. 2024	✅ Device SE 90.0% SP 60.7%; PCC alone SE 40.0% SP 84.8%	✅ KEY: 27.1% Fitzpatrick V; SP 53.2% (I–III) vs 69.1% (IV–VI)	❌ None	❌ None	❌ None
Barata et al. 2023	❌ Specialist only: dermatologist decision support (dermoscopy)	❌ HAM10000 base; skin-type diversity not quantified	❌ None	❌ None	❌ None
Jaklitsch et al. 2025	N/A — not uploaded	N/A	N/A	N/A	N/A
Jaklitsch et al. 2023	✅ Already in T4: PCPs SE 88% vs 67%; BCC device SE 100%	❌ Not stratified by skin type	❌ None	❌ None	❌ None

New findings — Goldfarb et al. 2021 (IHS4 ICC benchmark)

Full citation: N. Goldfarb, J.R. Ingram, G.B.E. Jemec, et al. "Hidradenitis Suppurativa Area and Severity Index Revised (HASI-R)." Br J Dermatol. 2021;184(5):905–912. DOI: 10.1111/bjd.19565.

Study design: Clinometric assessment of HASI-R (a novel HS severity tool); multi-rater study; dermatology clinic (Minneapolis VA + collaborators); evaluated inter-rater reliability, intra-rater reliability, convergent/divergent validity; raters assessed patients across full HS severity range (Hurley I–III).

Relevance: This paper provides the only published multi-tool comparison of inter-rater ICC values for HS severity measures — including IHS4 — and applies the same ICC interpretation framework as AIHS4_2023.

Key IHS4 ICC data (cited within this paper from published literature):

IHS4 ICC (Thorlacius et al., original study, ref 4): 0.47 — fair inter-rater reliability
IHS4 ICC (Zouboulis et al., ref 9): 0.69 — moderate inter-rater reliability
IHS4 ICC (Włodarek et al., ref 10, with training): >0.75 — high inter-rater reliability; training demonstrated to improve agreement

ICC classification used (Koo & Li 2016, ref 21): <0.5 = poor; 0.5–0.75 = moderate; 0.76–0.89 = high; >0.9 = excellent.

HASI-R own ICC: inter-rater 0.60 (moderate); intra-rater 0.91 (excellent). HASI-R outperforms all other HS tools.

Convergent validity: IHS4 correlates with HASI-R (r = 0.81, strong association).

T5 significance: The device's AIHS4_2023 achieved ICC 0.727 — which sits in the moderate-to-approaching-high range (0.5–0.75 scale), and falls within the published human expert IHS4 inter-rater range of 0.47–>0.75. The range 0.69–>0.75 from more recent studies (with experienced raters/training) is the appropriate benchmark; 0.727 falls squarely within this. This independently corroborates the Wiala 2024 cited expert range (0.68–0.78) and together the two papers establish that ICC 0.727 is consistent with expert human rater performance for IHS4.

New downstream edit triggered: Add Goldfarb 2021 to the IHS4 ICC justification in the CER alongside Wiala 2024, to cite the published IHS4 inter-rater ICC range of 0.47–>0.75 from clinical studies.

New findings — Tepedino et al. 2024 (Fitzpatrick V data)

Full citation: M. Tepedino, D. Baltazar, K. Hanna, A. Bridges, L. Billot, N.C. Zeitouni. "Elastic Scattering Spectroscopy on Patient-Selected Lesions Concerning for Skin Cancer." J Am Board Fam Med. 2024;37:427–435.

Study design: Prospective; 3 PCCs; 178 lesions from 155 patients; DermaSensor ESS device; comparison vs. pathology or 3-dermatologist panel; US primary care.

Fitzpatrick breakdown: Fitzpatrick I–III 51.0%; Fitzpatrick V: 42 patients (27.1% of total). Skin tone: 62.9% non-pigmented, 37.1% dark.

Key Fitzpatrick result: Device specificity by skin type — 53.2% for Fitzpatrick I–III vs 69.1% for Fitzpatrick IV–VI. Device maintains or improves diagnostic performance in darker skin types.

Device overall: sensitivity 90.0% (95% CI 71.4–100%), specificity 60.7%; PCC alone: sensitivity 40.0%, specificity 84.8%; AUC device 0.815 vs PCC 0.643.

T2/T8 significance: Tepedino 2024 provides direct evidence that the ESS-based AI maintains specificity across Fitzpatrick types I–VI, with 27.1% of the study population being Fitzpatrick V. Specificity is actually higher in darker skin (69.1% vs 53.2%). This is the strongest available Option A evidence for the T2 hybrid argument — it goes beyond Walker 2025 and Dulmage 2021 by showing in-practice primary care data with a substantial proportion of Fitzpatrick V patients.

New downstream edit triggered: Add Tepedino 2024 to the T2 Option A evidence set alongside Walker 2025 and Dulmage 2021 as primary care evidence for maintained AI performance across Fitzpatrick types.

Reinforcement of T4 findings — additional BCC/cSCC non-specialist benchmarks

The following articles add supporting context to T4 (BCC/cSCC AI in non-specialist settings) but do not change the core conclusions:

Krakowski et al. 2024 (meta-analysis, 17 studies): Non-dermatologist clinicians (PCPs, nurse practitioners, medical students) achieved pooled SE 66.3% / SP 70.1% without AI, improving to SE 79.3% / SP 80.9% with AI (p=0.003/0.011). Largest AI benefit was in the non-dermatologist subgroup. Dermatologists without AI: SE 81.8% / SP 79.2%, rising to 86.5% / 87.2% with AI. This establishes that AI provides the greatest relative improvement in non-specialist settings — contextualising the device's intended use.

Marsden et al. 2024 (UK teledermatology RCT): AIaMD (DERM) set at 91–92.5% sensitivity for malignancy; specificity AIaMD-A 77.5% vs SoC 73.6% (p=0.001 favouring AIaMD); reduces unnecessary urgent referrals. All Fitzpatrick types represented; 4.0% type IV–VI (25/622) — insufficient sample for Fitzpatrick subgroup analysis. Supports teledermatology context for the device's non-specialist pathway.

Sangers et al. 2022 (CE-marked mHealth app): Prospective multi-centre diagnostic accuracy study at GP-referred dermatology outpatient; sensitivity 86.9% (95% CI 82.3–90.7%), specificity 70.4% (66.2–74.3%); >80% Fitzpatrick I–II (limitation). BCC (116 cases) and SCC (40 cases) in the suspicious-lesion group. Establishes CE-marked mHealth accuracy in a semi-primary-care-pathway context.

Gregor et al. 2023 (GP feasibility pilot): mHealth app for skin cancer detection in GP practices; small pilot (n=70); app SE 87–95%, SP 70–78% (estimated from prior validation); GP SE 80.0% (95% CI 44.4–97.5%), SP 80.0% (63.1–91.6%) on 11 (pre)malignant + 35 benign cases; 90% white skin (limitation). Demonstrates AI-app feasibility in GP primary care pathway; small n limits generalisability.

Chen et al. 2024 (systematic review): Dermatologist keratinocyte carcinoma performance benchmark (clinical exam): SE 79.0%, SP 89.1%; dermoscopy: SE 83.7%, SP 87.4%; PCP: SE 81.4%, SP 80.1%. Fitzpatrick mostly I–III; 5 of included studies reported Fitzpatrick type. Provides the specialist dermatologist ceiling benchmark against which non-specialist AI-aided performance can be compared.

Barata et al. 2023 (specialist RL dermoscopy): RL model improved melanoma SE from 61.4% to 79.5% (95% CI 73.5–85.6%); BCC SE improved to 87.1% (95% CI 80.3–93.9%) in 89-dermatologist reader study. Specialist-setting only; not applicable to non-specialist T4 context but relevant to the overall dermatology AI SotA. Dataset diversity (HAM10000 + patient-centered subset from Portugal/Argentina) noted.

Answer

New downstream edits triggered by T7:

IHS4 ICC (T5): Add Goldfarb 2021 as a second supporting reference alongside Wiala 2024, confirming that IHS4 inter-rater ICC in published clinical studies ranges from 0.47 to >0.75 depending on rater experience and training, and that 0.727 falls within the upper-moderate/approaching-high range.
Fitzpatrick T2 evidence (T2/T8): Add Tepedino 2024 to Option A evidence — provides in-practice primary care data with 27.1% Fitzpatrick V patients; device specificity 69.1% in Fitzpatrick IV–VI vs 53.2% in I–III.
T4 SotA reinforcement: Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 all reinforce the non-specialist and teledermatology accuracy benchmarks. They can be added to the SotA NMSC section as secondary supporting evidence alongside the 4 primary T4 papers.

No new pediatric, autoimmune, or severity data found across the 10 reviewed articles (beyond Goldfarb 2021 for severity/ICC). T9–T12 searches remain necessary for those dimensions.

T8: Literature search B1 — Fitzpatrick V–VI AI dermatology

Status: ✅ Done — 21 results screened; 9 papers included

Purpose: Determine whether T2 can follow Option A (cite external evidence) or must follow Option B (§6.5(e) declaration).

Search executed: 2026-04-10. PubMed, 21 results. Filters: Free full text, full text, English, Humans.

Eligibility screening (21 results)

#	Reference	Skin tone data?	Quantitative metrics?	Eligible?	Notes
1–7	GBD 2021/2023 epidemiology studies (Lancet, JACC)	❌	❌	❌ No	GBD studies — matched on "sub-Saharan" / "machine learning" for forecasting; wrong domain
8	Menzies et al. 2023 — Lancet Digit Health	❌	❌	❌ No	Study explicitly restricted to Fitzpatrick I–III; IV–VI excluded
9	Kim et al. 2025 — Sci Rep	Fitzpatrick III–IV only	Limited	❌ No	Korean Demodex study; Fitzpatrick III–IV (not V–VI); notes need for diverse population validation
10	Mathur et al. 2021 — Dermatol Ther	Qualitative only	❌	❌ No	CNN for COVID-19 skin lesions; mentions "robust on skin of color" but no V–VI specific metrics
11	Benčević et al. 2024 — Comput Methods Programs Biomed	All Fitzpatrick types	✅	✅ Yes	Quantifies skin color bias in lesion segmentation; uses Fitzpatrick estimation across datasets
12	Aggarwal & Papay 2022 — J Dermatolog Treat	Brown skin tone	✅	✅ Yes	AI for BCC/melanoma in racially diverse populations; reports sensitivity/specificity/AUC for brown skin
13	Groh et al. 2024 — Nat Med	Dark vs. light skin	✅	✅ Yes	389 dermatologists + 459 PCPs; 46 diseases; 4 pp accuracy gap for dark skin; AI improves accuracy but can exacerbate gap
14	Han et al. 2022 — J Invest Dermatol	Fitzpatrick III–IV only	✅	❌ No	Korean RCT; Fitzpatrick III–IV (not V–VI); AI augmentation validated in Asian skin but not dark skin specifically
15	Pan et al. 2025 — Sci Rep	❌	❌	❌ No	ML for NMSC epidemiological burden forecasting — not AI diagnostic validation in dark skin
16	Liu et al. 2023 — Dermatology	Skin of color (Fitzpatrick IV–VI)	✅	✅ Yes	Systematic review specifically on AI for pigmented lesions in skin of color; 22 studies reviewed
17	Kamulegeya et al. 2023 — Afr Health Sci	Fitzpatrick 6 (Uganda)	✅	✅ Yes	AI on Fitzpatrick 6 in Uganda; diagnostic accuracy 17% vs 69.9% (Caucasian); severe performance gap documented
18	Flament et al. 2023 — Skin Res Technol	South African men (dark skin)	✅	✅ Yes	Automatic AI grading of facial signs in dark-skinned population; correlations 0.59–0.95 vs. dermatologists
19	Walker et al. 2025 — Oncology	Fitzpatrick I–III vs IV–VI	✅	✅ Yes	Direct comparison: AUC 0.858 (I–III) vs 0.856 (IV–VI), p = NS; no significant performance difference
20	Tjiu & Lu 2025 — Medicina (Meta-Analysis)	Fitzpatrick I–III vs IV–VI	✅	✅ Yes	Meta-analysis 18 studies: AUROC 0.89 (I–III) vs 0.82 (IV–VI); persistent fairness gap documented
21	Dulmage et al. 2021 — J Invest Dermatol	Fitzpatrick I–III vs IV–VI	✅	✅ Yes	Point-of-care AI wide range skin diseases; accuracy 70% (I–III) vs 68% (IV–VI), p = 0.79 (NS)

CRIT1-7 scoring — included papers

Criterion	Benčević 2024	Aggarwal 2022	Groh 2024	Liu 2023	Kamulegeya 2023	Flament 2023	Walker 2025	Tjiu 2025	Dulmage 2021
CRIT1 (similar device or benchmark)	1	2	2	2	2	1	2	2	2
CRIT2 (clinical setting)	0	0	2	1	2	1	2	2	1
CRIT3 (population representativeness)	1	1	1	2	2	2	2	2	1
Relevance subtotal	2/6	3/6	5/6	5/6	6/6	4/6	6/6	6/6	4/6
CRIT4 (level of evidence ≥ 4)	1	1	1	1	1	1	1	1	1
CRIT5 (quantitative performance data)	1	1	1	1	1	1	1	1	1
CRIT6 (clinical significance)	0	1	1	1	1	1	1	1	1
CRIT7 (statistical analysis)	1	1	1	0	0	1	1	1	1
Quality subtotal	3/4	4/4	4/4	3/4	3/4	4/4	4/4	4/4	4/4
Total weight	5/10	7/10	9/10	8/10	9/10	8/10	10/10	10/10	8/10
Include?	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes

CRIT2 note for Benčević and Aggarwal: Score 0 because both are computational/algorithm studies using public image datasets, not clinical deployment studies.

CRIT3 note for Groh and Dulmage: Score 1 because the populations do not fully map to the device's target patient population (Groh uses 46-condition test sets in a simulated teledermatology scenario; Dulmage uses a test bank of images for morphology classification).

CRIT7 note for Liu and Kamulegeya: Score 0 because neither provides inferential statistical comparisons for the skin-tone performance metrics specifically (Liu is a systematic review with quality review; Kamulegeya reports only descriptive accuracy).

Key data — highest-priority papers

Tjiu & Lu 2025 — Medicina (10/10) — Meta-Analysis

Full citation: J.W. Tjiu, C.F. Lu. "Equity and Generalizability of Artificial Intelligence for Skin-Lesion Diagnosis Using Clinical, Dermoscopic, and Smartphone Images: A Systematic Review and Meta-Analysis." Medicina (Kaunas). 2025;61(12):2186. DOI: 10.3390/medicina61122186.

Design: Systematic review and meta-analysis; 18 studies (11 melanoma, 7 mixed benign-malignant); PubMed/Embase/Web of Science/ClinicalTrials.gov; QUADAS-2 risk of bias; GRADE evidence certainty.

Key results:

Pooled sensitivity: 0.91 (95% CI 0.74–0.97); pooled specificity: 0.64 (95% CI 0.47–0.78)
HSROC AUROC overall: 0.88 (95% CI 0.84–0.92)
AUROC by skin tone: 0.82 (Fitzpatrick IV–VI) vs 0.89 (Fitzpatrick I–III) — fairness gap documented
Performance by setting: specialist 0.90, community care 0.85, smartphone 0.81
Conclusion: "AI-based dermatology systems achieve high diagnostic accuracy but demonstrate reduced performance in darker skin tones"

T8 significance: Strongest available evidence quantifying the AI skin-tone performance gap. AUROC 0.82 for Fitzpatrick IV–VI is still clinically relevant (above the 0.80 SotA benchmark for malignancy detection), but lower than light skin. This gap is field-wide and supports the §6.5(e) acceptable gap argument.

Walker et al. 2025 — Oncology (10/10)

Full citation: B.N. Walker, T.W. Blalock, R. Leibowitz, Y. Oron, D. Dascalu, E.O. David, A. Dascalu. "Skin Cancer Detection in Diverse Skin Tones by Machine Learning Combining Audio and Visual Convolutional Neural Networks." Oncology. 2025;103(5):413–420. DOI: 10.1159/000541573.

Design: Retrospective; 60 Fitzpatrick I–III vs. 72 Fitzpatrick IV–VI biopsy-validated smartphone images; dual audio-visual CNN (sonification-aided); malignant vs. benign dichotomous output.

Key results:

AUC: 0.858 (I–III, 95% CI 0.795–0.921) vs 0.856 (IV–VI, 95% CI 0.759–0.953), p = NS
Sensitivity: 84.4% (71.8–96.9) vs 79.6% (63.4–93.8), p = NS
Specificity: 84.2% (72.6–95.8) vs 85.3% (73.4–97.2), p = NS
Accuracy: 0.817 (I–III) vs 0.847 (IV–VI) — comparable

T8 significance: Strongest available evidence that a well-designed AI skin cancer detection tool achieves statistically equivalent performance across Fitzpatrick I–III and IV–VI. Supports Option A for T2. Limitation: small sample per group; single AI architecture (sonification method).

Groh et al. 2024 — Nat Med (9/10)

Full citation: M. Groh, O. Badri, R. Daneshjou, A. Koochek, C. Harris, L.R. Soenksen, P.M. Doraiswamy, R. Picard. "Deep learning-aided decision support for diagnosis of skin disease across skin tones." Nat Med. 2024;30(2):573–583. DOI: 10.1038/s41591-023-02728-3.

Design: Large-scale digital experiment; 389 board-certified dermatologists + 459 PCPs from 39 countries; 364 images, 46 skin diseases; store-and-forward teledermatology simulation.

Key results:

Specialist accuracy: 38%; generalist accuracy: 19%
Both specialists AND generalists were 4 pp less accurate for dark skin images (human baseline gap)
Fair deep learning AI improved both specialists and generalists by >33% overall
BUT AI exacerbated the accuracy gap across skin tones for generalists

T8 significance: Key finding — the skin-tone accuracy gap is a human (not just AI) limitation. AI-aided diagnosis improves overall accuracy but does not necessarily close the skin-tone gap. This contextualises the device's phototype limitation as consistent with the SotA challenge. The 4 pp gap in specialist performance across skin tones is the most relevant human benchmark.

Liu et al. 2023 — Dermatology (8/10) — Systematic Review

Full citation: Y. Liu, C.A. Primiero, V. Kulkarni, H.P. Soyer, B. Betz-Stablein. "Artificial Intelligence for the Classification of Pigmented Skin Lesions in Populations with Skin of Color: A Systematic Review." Dermatology. 2023;239(4):499–513. DOI: 10.1159/000530225.

Design: Systematic review; 22 eligible articles; only studies with ≥ 10% skin-of-color images in training data.

Key results:

Majority from East Asian populations (Chinese 7/22, Korean 5/22, Japanese 3/22)
Only 7 studies included Fitzpatrick IV–VI or diverse datasets
Binary outcome accuracy: 70–99.7%; multiclass accuracy: 43–93%
"Insufficient evidence to comment on the overall accuracy of AI models for darker skin types" (Fitzpatrick V–VI specifically)

T8 significance: The field-wide evidence gap for Fitzpatrick V–VI is confirmed — the SotA itself lacks adequate representation. This is the strongest argument for the §6.5(e) acceptable gap: the device's limitation mirrors the state of the art. Cannot be attributed to a device-specific failure.

Kamulegeya et al. 2023 — Afr Health Sci (9/10)

Full citation: L. Kamulegeya, J. Bwanika, M. Okello, D. Rusoke, F. Nassiwa, W. Lubega, D. Musinguzi, A. Börve. "Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning." Afr Health Sci. 2023;23(2):753–763. DOI: 10.4314/ahs.v23i2.86.

Design: Retrospective; 123 images from Ugandan telehealth database; Fitzpatrick 6 (dark skin); tests Skin Image Search AI app.

Key results:

Overall AI accuracy on Fitzpatrick 6: 17% (21/123)
Reported training performance (Caucasian): 69.9%
Performance by condition: dermatitis best (80%); most conditions very low
"Need for diversity of image datasets used to train dermatology algorithms"

T8 significance: Documents the worst-case performance gap for an untrained/undertrained AI on dark skin. The 17% accuracy is for a generic consumer AI app not specifically trained on dark skin — useful as evidence that skin-tone performance gap is a recognised SotA challenge. The device's ViT-based approach specifically trained on a multi-ethnic dataset is architecturally different. Use to contextualise the field-wide challenge.

Dulmage et al. 2021 — J Invest Dermatol (8/10)

Full citation: B. Dulmage, K. Tegtmeyer, M.Z. Zhang, M. Colavincenzo, S. Xu. "A Point-of-Care, Real-Time Artificial Intelligence System to Support Clinician Diagnosis of a Wide Range of Skin Diseases." J Invest Dermatol. 2021;141(5):1230–1235. DOI: 10.1016/j.jid.2020.08.027.

Design: Point-of-care AI tested on 222 images of heterogeneous Fitzpatrick types.

Key results:

Overall AI accuracy: 68%
Fitzpatrick I–III: 70%; Fitzpatrick IV–VI: 68%, p = 0.79 (NS)

T8 significance: No statistically significant difference between AI accuracy in Fitzpatrick I–III and IV–VI for a wide-range skin disease AI (p = 0.79). Supports Option A. Limitation: single study, 222 images.

Answer — T2 decision triggered

T2 decision: Hybrid approach (Option A + B combined)

The evidence from T8 is mixed and supports neither a pure Option A nor a pure Option B approach:

For Option A (cite external evidence showing adequate V–VI performance):

Walker 2025: AUC 0.856 vs 0.858 (p = NS) — no significant difference in skin cancer detection across Fitzpatrick groups
Dulmage 2021: 68% vs 70% accuracy (p = 0.79) — no significant difference for wide-range skin disease diagnosis

For Option B (§6.5(e) acceptable gap declaration):

Tjiu 2025 meta-analysis: AUROC 0.82 (IV–VI) vs 0.89 (I–III) — persistent 7-point gap across the field
Liu 2023 systematic review: insufficient evidence for Fitzpatrick V–VI across the entire SotA — the field itself lacks adequate data
Kamulegeya 2023: 17% vs 69.9% accuracy for untrained AI on Fitzpatrick 6 — confirms data gap and training dependency
Groh 2024: both human specialists AND AI are 4 pp less accurate in dark skin — gap is not device-specific

Recommended T2 approach:

Apply Option A: cite Walker 2025 and Dulmage 2021 as SotA evidence that well-designed AI skin tools can achieve comparable performance across Fitzpatrick I–VI
Apply Option B simultaneously: add a formal §6.5(e) acceptable gap declaration noting that the field-wide evidence for Fitzpatrick V–VI is insufficient (Liu 2023), the gap is SotA-wide (Tjiu 2025: AUROC 0.82 vs 0.89), and the device's ViT-based architecture handles phototype variation through relative intensity assessment
Cite the device's own ASCORAD_2022 study as internal evidence for Fitzpatrick IV–VI testing (112 images)
Confirm PMCF monitoring commitment for phototype performance stratification

T9: Literature search B2 — Pediatric AI dermatology

Status: ✅ Done — 26 results screened; 1 paper directly included for T9; 4 papers redirected to T10/T11

Purpose: Contextualise the 6.3% pediatric proportion in the device's clinical evidence base.

Search executed: 2026-04-10. PubMed, 26 results. Filters: Free full text, Full text, English, Humans.

Important search observation: The very low yield of qualifying papers is itself evidence of a field-wide gap. Only 1 of 26 results reported AI diagnostic performance specifically in a pediatric skin disease population — consistent with the general SotA literature showing that pediatric AI dermatology is underrepresented. This supports the §6.5(e) acceptable gap argument for the 6.3% pediatric proportion.

Eligibility screening (26 results)

#	Reference	Pediatric skin AI?	Eligible?	Notes
1	Goodman et al. 2023 — JAMA Netw Open	❌	❌ No	ChatGPT accuracy for physician questions — wrong domain
2	Wang et al. 2024 — Nat Commun	❌	❌ No	AI for cervical cytology — wrong anatomical site
3	Fetahu et al. 2023 — Nat Commun	❌	❌ No	Single-cell transcriptomics in neuroblastoma — cancer biology
4	Hu et al. 2025 — Sci Rep	❌	❌ No	ML for melanoma prognosis — prognostic prediction, not pediatric AI diagnosis
5	Abràmoff et al. 2022 — Ophthalmology	❌	❌ No	AI for ophthalmic images — wrong anatomical site
6	Wang et al. 2024 — Comput Biol Med	❌	❌ No	DL for pilomatricoma histopathology WSI — specialist histopathology, not clinical AI diagnosis
7	Huang et al. 2024 — Artif Intell Med	❌ (T11/T12)	❌ No for T9	AI eczema severity systematic review — redirected to T11
8	Marri et al. 2024 — JMIR Dermatol	⚠️ Borderline	⚠️ Weak	Aysa AI app includes patients ≥2 years but no pediatric-specific performance data; redirected to T11
9	Yu et al. 2025 — Photodiagnosis Photodyn Ther	✅ Yes	✅ Yes	Deep learning vs. dermatologists for childhood vitiligo — 474 pediatric patients; DL AUC 0.91
10	Mashoudy et al. 2024 — Arch Dermatol Res	❌	❌ No	Telemedicine review for skin cancer care — not pediatric-specific
11	Lu et al. 2025 — Sci Rep	❌	❌ No	Pediatric T-cell ALL biomarkers — leukemia, not skin
12	Jalali-Najafabadi et al. 2021 — Sci Rep	❌	❌ No	Genetic risk prediction for psoriasis/AS — not diagnostic AI; redirected to T11
13	Aung et al. 2025 — JAMA Netw Open	❌	❌ No	TIL assessment in melanoma — not pediatric-specific
14	Huo et al. 2025 — Cell Signal	❌	❌ No	Molecular pathway in keloid — molecular biology
15	Soe et al. 2024 — J Med Internet Res	❌	❌ No	AI to differentiate mpox — not pediatric-specific
16	Lang et al. 2020 — J Invest Dermatol	❌	❌ No	Ciliation index for spitzoid neoplasms — not AI-based
17	Ha et al. 2022 — Pediatr Rheumatol	❌	❌ No	Blood transcriptomics in pediatric rheumatic diseases — no skin AI diagnostic outcome
18	Kurugol et al. 2015 — J Invest Dermatol	❌	❌ No	Automated DEJ delineation in RCM — technical methodology only
19	Benítez-Andrades et al. 2025 — Sci Rep	❌	❌ No	Accelerometer for balance in schoolchildren — wrong domain
20	Yan et al. 2025 — Sci Rep	❌	❌ No	ML for port-wine stain treatment prediction — treatment response, not diagnostic AI
21	Maintz et al. 2021 — JAMA Dermatol	❌ (T11/T10)	❌ No for T9	ML deep phenotyping of AD in adolescent/adult — redirected to T11
22	Seité et al. 2019 — Exp Dermatol	❌ (T10)	❌ No for T9	AI acne grading from smartphone — adolescent-relevant but not pediatric-specific; redirected to T10
23	Zhou et al. 2021 — PLoS One	❌	❌ No	Predicting psoriasis from lab tests — no imaging AI; redirected to T11
24	Lipids Health Dis. 2025	❌	❌ No	Cardiovascular risk prediction — wrong domain
25	Au et al. 2023 — Sensors	❌ (T12)	❌ No for T9	Sensorised glove for AD scratching detection — wearable sensor; redirected to T12
26	Wang et al. 2025 — Sci Rep	❌	❌ No	DL for infant fundus photography quality — ophthalmology

CRIT1-7 scoring — included paper

Criterion	Yu et al. 2025
CRIT1 (study focus — similar device or clinical practice benchmark)	1
CRIT2 (clinical setting — dermatology, device supporting clinicians in skin assessment)	2
CRIT3 (population — pediatric, directly relevant to the gap)	2
Relevance subtotal	5/6
CRIT4 (study design — level of evidence ≥ 4)	1
CRIT5 (outcome measurement — quantitative performance data)	1
CRIT6 (clinical significance — head-to-head AI vs. dermatologist comparison)	1
CRIT7 (statistical analysis — ROC curves, CIs)	1
Quality subtotal	4/4
Total weight	9/10
Include?	✅ Yes

CRIT1 note: Scores 1 (not 2) because the study focuses on vitiligo diagnosis, not on the multi-condition AI assessment that characterises the device. The study is functionally adjacent (AI dermoscopy image classification in a clinical dermatology setting) but the specific clinical function differs.

Key data extracted — Yu et al. 2025

Full citation: S. Yu, Z. Chen, J. He, H. Wang. "Comparative study of dermatologists and deep learning model on diagnosing childhood vitiligo." Photodiagnosis Photodyn Ther. 2025;54:104727. DOI: 10.1016/j.pdpdt.2025.104727.

Study design: Prospective comparative study; 474 pediatric patients (223 vitiligo, 251 controls); three imaging modalities (dermoscopic images, Wood's lamp images, standard clinical photographs); eight dermatologists performed double-blind evaluation; two DL models (ResNet152, DenseNet121) trained on 3,896 dermoscopic images (80/20 train/validation split); China.

DL model performance (dermoscopy):

ResNet152: accuracy 83.08%, recall 86.84%, precision 81.08%, specificity 79.22%, F1 0.8386, AUC 0.91
DenseNet121: accuracy 81.41%, recall 83.41%, precision 82.03%, specificity 79.12%, F1 0.8271, AUC 0.89

Dermatologist performance (dermoscopy only, 8 clinicians):

AUC 0.77 (95% CI 0.51–1.00), sensitivity 0.88 (95% CI 0.53–0.99), specificity 0.75 (95% CI 0.41–0.96)
Performance correlated with years of experience

T9 significance: Both DL models outperform dermatologists on the AUC metric (0.91/0.89 vs 0.77) for diagnosis of vitiligo in a purely pediatric population (474 children). This is the only published study we identified with AI achieving high diagnostic accuracy in a dedicated pediatric skin disease population.

Dual T11 relevance: Vitiligo is an autoimmune skin condition — this paper is directly relevant to both T9 (pediatric AI dermatology) and T11 (autoimmune skin disease AI detection). Record under both tasks.

Limitations: Single condition (vitiligo); single dermoscopy modality evaluated in head-to-head; only 8 dermatologists (wide CIs on clinician performance); China single-site; DenseNet121 performance slightly lower than ResNet152.

Papers redirected to other tasks

Paper 7 — Huang et al. 2024 (Artif Intell Med): Systematic review of AI for eczema severity from digital images (25 studies). Notes that only 28% of studies report patient age range and 16% report skin phototype. Confirms field-wide data quality gap for age-disaggregated AI severity assessment. Redirected to T11 and T12.

Paper 21 — Maintz et al. 2021 (JAMA Dermatol): ML deep phenotyping of AD severity in adolescent and adult patients (n=367; 94% adults, 6% adolescents aged 12–21 years); EASI-based severity stratification; ML gradient-boosting AUC 0.71 (95% CI 0.69–0.72) for severity classification. Redirected to T11 (atopic dermatitis AI severity prediction).

Paper 22 — Seité et al. 2019 (Exp Dermatol): AI algorithm for acne grading from smartphone (1,072 patients; GEA scale; 68% agreement with dermatologists at final algorithm version). Redirected to T10 (AI severity grading — real-world clinical studies).

Paper 25 — Au et al. 2023 (Sensors): Sensorised glove for scratching detection in atopic dermatitis (ML model accuracy 83%–99%; pilot in 6 children). Wearable sensor approach, not visual AI. Redirected to T12 (UAS monitoring — adjacent severity assessment technology).

Answer — use in CER

Key conclusion: The T9 search found only 1 qualifying paper for pediatric AI dermatology — Yu et al. 2025 (childhood vitiligo, AUC 0.91 for DL). The thin yield across 26 results mirrors the field-wide evidence gap: pediatric AI dermatology diagnosis is a recognised SotA limitation.

CER representativeness section — dual-track argument:

Positive evidence (Option A): Cite Yu et al. 2025 — AI achieves AUC 0.91 in a purely pediatric skin disease population, outperforming dermatologists (AUC 0.77), demonstrating that AI diagnostic tools can generalise to pediatric patients.
Acceptable gap (Option B): The SotA itself lacks adequate pediatric-disaggregated evidence; only 1 of 26 pediatric AI dermatology search results was qualifying. Declare the 6.3% pediatric proportion as a formally justified §6.5(e) acceptable gap, noting field-wide underrepresentation and PMCF monitoring commitment.

T10: Literature search B3 — Severity Pillar 3 real-world clinical studies

Status: ✅ Done — 4 results screened; 4 papers included (3 primary + 1 contextual); 1 additional paper added from T9 redirect

Purpose: Find published studies using AI severity assessment in real clinical encounters (not atlas images) to partially bridge Gap 2 before PMCF results are available.

Search executed: 2026-04-10. PubMed, 4 results. Filters: Free full text, Full text, English, Humans. Small yield reflects the tight search string (requiring both ICC/agreement AND clinical/prospective AND AI/smartphone AND severity score).

Additional paper added: Seité et al. 2019 (Exp Dermatol) — AI acne grading from smartphone, redirected from T9 screening.

Eligibility screening (4 results + 1 redirect)

#	Reference	AI/smartphone severity in clinical encounter?	ICC or agreement reported?	Eligible?	Notes
1	Schaap et al. 2022 — J Eur Acad Dermatol Venereol	✅ CNN automated PASI scoring; clinical images from treating physician encounters	✅ ICC 0.58–0.79 per subscore	✅ Yes	Primary evidence — CNN matches physician; real clinical images
2	Maulana et al. 2024 — Narra J	⚠️ DL PASI classification, 1,546 "clinical images"	⚠️ Cohen's Kappa; no ICC; no clinical encounter matching	⚠️ Weak	Dataset-only development study; no real clinical encounter validation; "further clinical validation required"
3	Ali et al. 2022 — Skin Res Technol	✅ Smartphone photos taken by patients at home; assessed by dermatologists	✅ ICC 0.86–0.90 (photo vs. clinical)	✅ Yes	Primary evidence — AD (EASI/SCORAD); highest ICC in search
4	Ali et al. 2024 — Dermatology	⚠️ Consumer self-assessment (not AI); physician assesses photos	✅ ICC 0.23 (weak)	✅ Yes (contextual)	Establishes that unaided patient self-assessment is unreliable; supports AI assistance need
—	Seité et al. 2019 — Exp Dermatol	✅ AI algorithm for acne GEA grading from smartphone	⚠️ 68% agreement (GEA grade); no ICC	✅ Yes	From T9 redirect; smartphone AI severity from dermatology clinic

CRIT1-7 scoring — included papers

Criterion	Schaap 2022	Ali 2022	Ali 2024	Seité 2019
CRIT1 (AI/automated severity scoring, similar to device function)	2	2	1	2
CRIT2 (clinical setting — real patients, clinical data)	2	2	1	1
CRIT3 (population representativeness — chronic inflammatory skin disease)	2	1	1	1
Relevance subtotal	6/6	5/6	3/6	4/6
CRIT4 (level of evidence ≥ 4 — prospective/validation)	1	1	1	1
CRIT5 (quantitative ICC or agreement metric)	1	1	1	1
CRIT6 (clinical significance — aids treatment monitoring or decisions)	1	1	0	1
CRIT7 (statistical analysis — CIs or inferential statistics)	1	1	1	0
Quality subtotal	4/4	4/4	3/4	3/4
Total weight	10/10	9/10	6/10	7/10
Include?	✅ Yes	✅ Yes	✅ Yes (contextual)	✅ Yes

CRIT2 note for Ali 2024: Score 1 (not 2) because the clinical encounter is indirect — the physician assesses photographs retrospectively from a consumer database (NØIE), not from clinical visits.

CRIT2 note for Seité 2019: Score 1 (not 2) because the study is algorithm development on a curated dataset from a dermatology clinic, not a prospective clinical validation study.

CRIT3 note for Ali 2022: Score 1 (not 2) because the study is restricted to mild-to-moderate AD (EASI ≤21); severe cases excluded. Limits generalisability to full severity range.

CRIT6 note for Ali 2024: Score 0 because the result is negative — weak agreement (ICC 0.23) means the study does NOT support clinical utility of this approach; it demonstrates the insufficiency of unaided self-assessment.

Key data extracted

Schaap et al. 2022 — J Eur Acad Dermatol Venereol (10/10)

Full citation: M.J. Schaap, N.J. Cardozo, A. Patel, E.M.G.J. de Jong, B. van Ginneken, M.M.B. Seyger. "Image-based automated Psoriasis Area Severity Index scoring by Convolutional Neural Networks." J Eur Acad Dermatol Venereol. 2022;36(1):68–75. DOI: 10.1111/jdv.17711.

Study design: Retrospective validation; CNN-based automated PASI subscore classification from standardized clinical photographs; real clinical data — images matched to PASI subscores determined by treating physician in clinical practice; N = 576 trunk, 614 arm, 541 leg image series; Netherlands dermatology clinic.

CNN ICC vs. real-life clinical scores (trunk region):

Erythema: CNN 0.616 vs physician image-based 0.558
Desquamation: CNN 0.580 vs physician image-based 0.589 (physicians marginally better)
Induration: CNN 0.580 vs physician image-based 0.573
Area: CNN 0.793 vs physician image-based 0.694

Physician image-based PASI ICC (inter-rater, N=5): 0.706–0.793 (moderate-good agreement).

Key finding: CNN performs comparably to or slightly better than trained physicians for image-based PASI scoring. Area scoring is the most reliable domain (ICC 0.793 for CNN). Performance is consistent across trunk, arms, and legs.

Gap 2 significance: Establishes that automated CNN-based PASI scoring from clinical photographs achieves ICC 0.58–0.79 in real clinical data — directly comparable to the AIHS4_2023 ICC of 0.727. Provides the SotA benchmark for AI severity scoring in clinical encounters and shows that CNN performance matches physician performance for the most objective PASI subscores (area, erythema, induration).

Ali et al. 2022 — Skin Res Technol (9/10)

Full citation: Z. Ali, A. Chiriac, T. Bjerre-Christensen, et al. "Mild to moderate atopic dermatitis severity can be reliably assessed using smartphone-photographs taken by the patient at home: A validation study." Skin Res Technol. 2022;28(2):336–341. DOI: 10.1111/srt.13136.

Study design: Prospective validation; N=79 participants; AD severity evaluated in clinic by two assessors (EASI, SCORAD, IGA); participants photographed lesions at home using own smartphone; photographs assessed twice (8-week interval) by five dermatologists experienced in photographic evaluation; Denmark.

Key ICC results:

Clinical EASI vs photographic EASI: ICC 0.88 (95% CI 0.81–0.93)
Clinical SCORAD vs photographic SCORAD: ICC 0.86 (95% CI 0.70–0.93)
Perfect IGA agreement between clinical and photographic: 62%; never deviating >1 grade
Inter-rater ICC for photographic EASI: 0.90 (0.85–0.94)
Inter-rater ICC for photographic SCORAD: 0.96 (0.91–0.98)
Intra-rater reliability (photographic EASI): 0.95–0.98

Key finding: Excellent agreement (ICC 0.86–0.90) between clinical severity assessment and smartphone-photograph-based assessment for mild-to-moderate AD. The photographic ICC (0.88 for EASI) substantially exceeds the threshold for "good" reliability. Inter-rater reliability of photographic EASI (0.90) is higher than many direct clinical assessments reported in the literature.

Gap 2 significance: The highest ICC data in the search. Establishes the SotA ceiling for smartphone-based severity assessment — ICC up to 0.90 is achievable in real clinical encounters. This is the benchmark against which the device's Pillar 3 PMCF study should be designed.

Ali et al. 2024 — Dermatology (6/10) — contextual gap evidence

Full citation: Z. Ali, A. Al-Mousawi, B.Þ. Björnsson, A. Egeberg, C. Riemer, S.F. Thomsen. "The Agreement between Consumer-Driven Self-Assessment of Psoriasis Severity and Physician-Assessed Severity Based on Patient-Taken Photographs Is Weak." Dermatology. 2024;240(3):362–368. DOI: 10.1159/000536175.

Study design: Cross-sectional; N=187 psoriasis patients from NØIE consumer database (Denmark, 2009–2022); patient self-assessed severity (0–10 scale converted to 0–4 PASI-equivalent); physician assessed severity from patient-taken smartphone photographs (erythema, induration, scaling 0–4 PASI-equivalent).

ICC results:

Overall: ICC 0.23 (95% CI 0.00–0.92) — very weak
Chronic patients: ICC 0.34 (95% CI 0.00–0.95)
Non-chronic: ICC 0.09 (95% CI -0.01–0.82)
Men: ICC 0.53; women: ICC 0.12

Gap 2 significance: Consumer-driven self-assessment without AI is unreliable (ICC 0.23). This paper establishes the lower bound of unaided self-assessment and demonstrates why AI assistance is needed to achieve reliable severity scoring. The contrast with Paper 3 (ICC 0.86–0.90 with physician-assessed photos) and Paper 1 (ICC 0.58–0.79 with CNN) quantifies the benefit of AI/physician involvement over unaided self-assessment.

Seité et al. 2019 — Exp Dermatol (7/10)

Full citation: S. Seité, A. Khammari, M. Benzaquen, D. Moyal, B. Dréno. "Development and accuracy of an artificial intelligence algorithm for acne grading from smartphone photographs." Exp Dermatol. 2019;28(11):1252–1257. DOI: 10.1111/exd.14022.

Study design: Algorithm development + validation; N=1,072 acne patients; 5,972 smartphone images; GEA (Global Evaluation of Acne) scale; three trained dermatologists provided reference grades; AI algorithm trained and iterated across six versions; France.

Key metric: At final version 6, GEA grading by AI algorithm reached 68% agreement with dermatologist consensus. Algorithm identifies comedonal, inflammatory lesion types and post-inflammatory hyperpigmentation.

Gap 2 significance: Demonstrates AI acne severity grading from smartphones with 68% agreement — a lower ICC-equivalent than AD (Paper 3, ICC 0.88) and PASI (Paper 1, ICC 0.58–0.79). Acne severity AI from smartphones is less mature than PASI/EASI automation. Useful as contextual SotA evidence that severity grading from smartphones is feasible across multiple skin conditions, even if acne is not the device's primary severity function.

Answer — use in CER Gap 2 declaration

Key conclusion: The search found 4 papers (+ 1 redirect) establishing a clear SotA benchmark for AI/smartphone-based severity scoring in clinical encounters:

Paper	Condition	Scale	ICC / Agreement	Clinical data?
Schaap 2022	Psoriasis	PASI	CNN 0.58–0.79 (≈ physician 0.71–0.79)	✅ Matched to clinic PASI
Ali 2022	Atopic dermatitis	EASI / SCORAD	0.86–0.90 (photo vs. clinical)	✅ Prospective clinic + home
Ali 2024	Psoriasis	PASI-equivalent	0.23 (unaided self-assessment)	✅ Consumer RWD
Seité 2019	Acne	GEA	68% agreement	✅ Dermatology clinic

CER Gap 2 argument (two-part):

SotA context: Automated/smartphone severity scoring achieves ICC 0.58–0.90 across dermatological conditions in clinical encounters. The device's AIHS4_2023 ICC of 0.727 is within this range — it is not an outlier. The SotA for PASI/EASI automation (ICC 0.58–0.90) provides the benchmark against which Pillar 3 PMCF data will be judged.
Gap justification: No paper in the search validates real-world severity scoring in HS (IHS4), psoriasis (PASI), or urticaria (UAS) in an unselected primary care population with the device architecture used here. The SotA itself confirms that real-world clinical validation of AI severity assessment is still emerging — "further clinical validation and model refinement remain required" (Maulana 2024). This supports the §6.5(e) acceptable gap declaration, with PMCF monitoring as the resolution pathway.

Downstream edit: Add all four papers to the CER Gap 2 section. Update the §6.5(e) declaration to cite this SotA evidence showing that: (a) AI severity scoring in clinical encounters achieves ICC 0.58–0.90 across skin conditions — the device's AIHS4 ICC of 0.727 is within this range; (b) real-world clinical severity validation is a recognised SotA gap, not a device-specific failure.

T11: Literature search C1 — Autoimmune skin disease AI detection

Status: ✅ Done — 105 results screened; 4 papers included (2 primary, 2 SotA context); field-wide gap confirmed

Purpose: Strengthen Gap 4 acceptable gap justification by showing the SotA itself lacks strong AI evidence for autoimmune visual diagnosis.

Search executed: 2026-04-10. PubMed, 105 results. Filters: Free full text, full text, English, Humans.

Key search observation: The vast majority of papers (>95%) involve genomics, transcriptomics, proteomics, or blood biomarker ML for systemic autoimmune diseases — not clinical skin image AI. Image-based AI papers use specialised modalities (nailfold capillaroscopy, IIF tissue sections, retinal OCTA). Only 1 paper uses clinical skin photographs for a disease panel that includes autoimmune skin conditions.

Eligibility screening — image-based AI papers (subset of 105)

#	Reference	Modality	Autoimmune condition	Clinical skin image AI?	Eligible?
9	Lledó-Ibáñez 2025 — Rheumatology	Nailfold capillaroscopy	SSc	❌ Specialised	⚠️ SotA context only
20	Bharathi 2023 — Rheumatology	Nailfold capillaroscopy	SSc	❌ Specialised	✅ Yes (SotA context — 7/10)
36	Hocke 2023	IIF tissue sections	AIBD, pemphigus	❌ Lab immunofluorescence	❌ No — laboratory test
40	Garaiman 2023 — Rheumatology	Nailfold capillaroscopy	SSc	❌ Specialised	✅ Yes (SotA context — 6/10)
97	Li 2026 — RMD Open	Nailfold capillaroscopy	SSc/SLE/RA	❌ Specialised	⚠️ SotA context only
100	Mathur 2021 — Dermatol Ther	Clinical skin photographs	BP, urticaria (among 20 conditions)	✅ Yes	✅ Yes (9/10)
T9 ref	Yu 2025 — Photodiagnosis Photodyn Ther	Clinical dermoscopy	Vitiligo (autoimmune)	✅ Yes	✅ Yes (9/10, scored in T9)
T9 ref	Huang 2024 — Artif Intell Med	Digital skin photographs	AD (inflammatory)	✅ Yes	⚠️ Field-wide gap context
T9 ref	Maintz 2021 — JAMA Dermatol	Clinical data + images	AD (inflammatory)	⚠️ Partial	⚠️ Field-wide gap context

Remaining 96 papers (not tabulated individually): All involve genomics, transcriptomics, proteomics, NLP, or biomarker ML for systemic autoimmune diseases (SLE n≈50, SSc n≈20, RA/other n≈26). No clinical skin photography AI.

CRIT1-7 scoring — included papers

Criterion	Mathur 2021	Yu 2025 (vitiligo)	Bharathi 2023 (nailfold)	Garaiman 2023 (nailfold)
CRIT1 (image AI for skin conditions including autoimmune)	2	2	1	1
CRIT2 (clinical setting — supporting HCPs in skin assessment)	2	2	1	1
CRIT3 (includes autoimmune skin conditions)	1	2	1	1
Relevance subtotal	5/6	6/6	3/6	3/6
CRIT4 (level of evidence ≥ 4)	1	1	1	1
CRIT5 (quantitative performance data — AUC, accuracy)	1	1	1	1
CRIT6 (clinical significance — head-to-head or workflow impact)	1	1	1	1
CRIT7 (CIs or inferential statistics)	1	1	1	0
Quality subtotal	4/4	4/4	4/4	3/4
Total weight	9/10	9/10	7/10	6/10
Include?	✅ Yes	✅ Yes	✅ Yes (SotA context)	✅ Yes (SotA context)

CRIT1–2 note for nailfold papers: Score 1 (not 2) because nailfold capillaroscopy is a specialised imaging tool, not clinical skin photography. The AI function is analogous but the modality and pathway differ.

CRIT3 note for Mathur 2021: Score 1 (not 2) because the study targets COVID-19 cutaneous manifestations primarily; BP and urticaria appear within a 20-condition panel.

Key data extracted

Mathur et al. 2021 — Dermatol Ther (9/10)

Full citation: P. Mathur, B.D. Srivastava, P. Mathur, et al. "Artificial Intelligence-Based Classification of Multiple Skin Lesions Including Autoimmune Cutaneous Manifestations of COVID-19 Using Convolutional Neural Networks." Dermatol Ther. 2021;34(2):e14791. DOI: 10.1111/dth.14791.

Study design: CNN ensemble (EfficientNet-B3, ResNet50, VGG19) training and validation on clinical skin images; 20 conditions including bullous pemphigoid and urticaria; tested for performance on skin of color.

Key results:

Top-1 accuracy (ensemble): 86.7%
COVID-19 rash AUC: 0.97
Per-condition sensitivity/specificity reported
Robust on skin of color — explicitly validated

Gap 4 significance: Only paper in 105 search results applying clinical skin photograph CNN to a panel including autoimmune skin conditions (BP, urticaria). AUC 0.97 for the primary condition; 86.7% top-1 accuracy across 20 conditions. Provides the SotA benchmark for this indication. No independent replication in autoimmune-specific populations exists.

Yu et al. 2025 — Photodiagnosis Photodyn Ther (9/10) [cross-reference from T9]

Vitiligo is an autoimmune skin condition. DL model (ResNet152) AUC 0.91 in 474 pediatric vitiligo patients, outperforming dermatologists (AUC 0.77). Full data under T9.

Gap 4 significance: Together with Mathur 2021, these two papers form the totality of direct SotA evidence for clinical skin image AI applied to autoimmune conditions.

Bharathi et al. 2023 — Rheumatology (7/10) — SotA context

Nailfold capillaroscopy DL for SSc. AUC 97% (94–99%), sensitivity/specificity 91% (86–95%). SSc expert consensus: sensitivity 82%, specificity 73%. AI outperforms experts.

Gap 4 significance: Demonstrates AI can achieve very high accuracy for autoimmune connective tissue disease from skin-surface images — establishing a comparable approach even if the modality differs.

Garaiman et al. 2023 — Rheumatology (6/10) — SotA context

ViT-based nailfold capillaroscopy for SSc. AUC 81.8–84.5%. Same ViT architecture family as the device. One of four rheumatologists performed at or below ViT level.

Gap 4 significance: ViT-based AI (same architecture as the device) achieves 81.8–84.5% AUC for autoimmune condition detection from skin-surface images.

Answer — use in CER Gap 4 declaration

Key conclusion: The T11 search found only 2 papers directly relevant to clinical skin image AI for autoimmune conditions — Mathur 2021 and Yu 2025. The thin yield across 105 results confirms a field-wide gap: clinical skin image AI validated for autoimmune conditions is extremely limited, not device-specific.

CER Gap 4 — §6.5(e) acceptable gap argument:

Positive evidence: Mathur 2021 (CNN 86.7% top-1 for 20-condition panel including BP/urticaria) and Yu 2025 (AUC 0.91 for vitiligo) show AI can achieve promising accuracy on autoimmune skin conditions.
Acceptable gap justification: Only 2 qualifying clinical skin image AI papers across 105 dedicated search results. The gap is SotA-wide, not addressable through literature review alone.
PMCF commitment: Prospective monitoring of performance in bullous pemphigoid and urticaria subgroups.

Downstream edit: Update the CER Gap 4 §6.5(e) declaration to cite Mathur 2021 (CNN 86.7% for 20-condition panel) and Yu 2025 (AUC 0.91 for vitiligo) as the only SotA benchmarks; note that only 2 qualifying papers exist across 105 search results, formally confirming the field-wide gap.

T12: Literature search C2 — UAS inter-rater benchmarks

Status: ✅ Done — 27 results screened; 2 papers with applicable UAS reliability data; clinician inter-rater benchmarks absent from field

Purpose: Contextualise the barely-met Krippendorff α = 0.603 for UAS severity.

Search executed: 2026-04-10. PubMed, 27 results. Filters: Free full text, full text, English, Humans.

Note on search string: "UAS" matched multiple unrelated acronyms — urinalysis system, unmanned aircraft system, ureteral access sheath, motor assessment scale, unprotected anal sex. Only 9 of 27 papers were urticaria-related; 2 contained UAS agreement/reliability data.

Eligibility screening — urticaria-relevant papers (9 of 27)

#	Reference	UAS agreement data?	Eligible?	Notes
1	Tuchinda 2022 — Thai 5-D itch scale	⚠️ Contextual	⚠️ Weak	ICC 0.90 for 5-D itch scale; UAS7 used as anchor
3	Schnarkowski 2025 — CholUAS	⚠️ Contextual	⚠️ Weak	Cholinergic urticaria-specific subtype score
4	Khoshkhui 2021 — Persian UCT	❌ No	❌ No	UCT reliability (Cronbach α 0.68); no UAS scoring data
9	Grekowitz 2025 — ColdUAS	⚠️ Contextual	⚠️ Weak	Cold urticaria-specific subtype score
10	Kocatürk 2012 — Turkish CU-Q2oL	❌ No	❌ No	Quality of life instrument; UAS peripheral
14	Kulthanan 2016 — Thai CU-Q2oL	❌ No	❌ No	Quality of life MCID study; UAS as anchor
17	Tavakol 2014 — Persian CU-Q2oL	❌ No	❌ No	Quality of life validation; UAS7 correlation peripheral
18	Hollis 2018 — Am J Clin Dermatol	✅ Yes	✅ Yes (10/10)	Weighted kappa 0.78–0.82 for UAS7 version comparison; n=614
22	Jauregui 2019 — Health Qual Life Outcomes	✅ Yes	✅ Yes (9/10)	Test-retest ICC 0.84; Cronbach α 0.83; n=166

T9 redirect — Au 2023 (Sensors): Sensorised glove for AD scratching detection; ML accuracy 83–99%. Wearable sensor, no UAS scoring agreement data. Not relevant.

CRIT1-7 scoring — included papers

Criterion	Hollis 2018	Jauregui 2019
CRIT1 (UAS agreement/reliability data applicable to benchmarking α = 0.603)	2	2
CRIT2 (clinical setting — urticaria patients in clinical/trial context)	2	2
CRIT3 (population — chronic spontaneous urticaria, same condition as device use case)	2	2
Relevance subtotal	6/6	6/6
CRIT4 (level of evidence ≥ 4)	1	1
CRIT5 (quantitative agreement metric — kappa or ICC)	1	1
CRIT6 (validates UAS as instrument for clinical use)	1	1
CRIT7 (95% CIs reported)	1	0
Quality subtotal	4/4	3/4
Total weight	10/10	9/10
Include?	✅ Yes	✅ Yes

CRIT7 note for Jauregui 2019: ICC = 0.84 reported without 95% CI in the abstract; scores 0.

Key data extracted

Hollis et al. 2018 — Am J Clin Dermatol (10/10)

Full citation: K. Hollis, C. Proctor, D. McBride, et al. "Comparison of Urticaria Activity Score Over 7 Days (UAS7) Values Obtained from Once-Daily and Twice-Daily Versions: Results from the ASSURE-CSU Study." Am J Clin Dermatol. 2018;19(2):267–274. DOI: 10.1007/s40257-017-0331-8.

Study design: ASSURE-CSU study data; N=614 CSU patients; twice-daily UAS7 (TD) vs once-daily UAS7-max (OD1MAX) and once-daily UAS7-average (OD2AVG); 5 severity score bands (0, 1–6, 7–15, 16–27, 28–42).

Key agreement results:

UAS7-TD vs UAS7-OD1MAX: weighted kappa κ = 0.78 (95% CI 0.75–0.82) — "substantial agreement"
UAS7-TD vs UAS7-OD2AVG: weighted kappa κ = 0.82 (95% CI 0.78–0.85) — "substantial agreement"
Pearson correlations: 0.94–0.99 across all version pairs

Benchmarking significance: Even in a large controlled trial, different UAS completion protocols yield κ = 0.78–0.82. This is the expected range for UAS scoring consistency in clinical contexts.

Jauregui et al. 2019 — Health Qual Life Outcomes (9/10)

Full citation: I. Jauregui, A. Gimenez-Arnau, J. Bartra, et al. "Psychometric properties of the Spanish version of the once-daily Urticaria Activity Score (UAS) in patients with chronic spontaneous urticaria managed in clinical practice (the EVALUAS study)." Health Qual Life Outcomes. 2019;17(1):23. DOI: 10.1186/s12955-019-1087-z.

Study design: Observational prospective; N=166 CSU patients; Spanish UAS7 completed on 7 consecutive days at two visits 6 weeks apart; test-retest reliability assessed.

Key reliability results:

Internal consistency: Cronbach α = 0.83
Test-retest reliability: ICC = 0.84
Minimal important difference (MID): 7–8 points (0–42 scale)

Benchmarking significance: Test-retest ICC = 0.84 for the same patient under stable conditions — represents the ceiling of reproducibility for patient self-completed UAS7.

Answer — use in CER UAS severity section

Key conclusion: No published study reports clinician-to-clinician inter-rater agreement for UAS. The UAS is a patient-reported outcome; reliability data reflect patient self-consistency (κ = 0.78–0.82, Hollis 2018; ICC 0.84, Jauregui 2019).

CER UAS α contextualisation:

The device's Krippendorff α = 0.603 for UAS severity classification is moderate agreement, approaching the Landis and Koch "substantial" threshold (0.61–0.80).
Published patient self-consistency for UAS7 under controlled conditions: κ = 0.78–0.82 (Hollis 2018). Real-world variability would be higher.
The device scores a patient-reported outcome from clinical photographs — an intrinsically harder task than patient self-report. α = 0.603 is a reasonable baseline for this novel modality.
No published clinician inter-rater UAS benchmark exists; α = 0.603 cannot be compared to an established clinical standard. It should be framed as a PMCF monitoring baseline, not a pass/fail criterion.

Downstream edit: Update the CER UAS severity section to cite Hollis 2018 (κ = 0.78–0.82 for UAS7 version consistency) and Jauregui 2019 (ICC = 0.84 for test-retest) as context; note that no clinician inter-rater UAS benchmark exists in the published literature; frame α = 0.603 as a PMCF baseline with trajectory monitoring commitment.

Downstream edits triggered

Once all tasks are complete, the following documents will require edits. This table is updated as answers are finalised.

CER edits (R-TF-015-003)

Section	Triggered by	Status
Line 818 — melanoma AUC criterion (replace >= 0.80 with >= 0.85; cite MC_EVCDAO_2019 as sole melanoma study; cross-ref 91.99% aggregate malignancy AUC)	T1	⬜ Ready to edit
§6.5(e) acceptable gap — Fitzpatrick V–VI (hybrid Option A: Walker 2025, Dulmage 2021, Tepedino 2024; Option B: Liu 2023, Tjiu 2025, ASCORAD_2022, ViT)	T2, T8, T7	⬜ Ready to edit
Lines 1833–1834 — alopecia dermatologist sub-criteria (primary endpoint = pooled HCP met; dermatologist sub-analysis exploratory; range restriction artefact)	T3	⬜ Ready to edit
CER NMSC section + acceptance criteria derivation table — add Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 as secondary SotA evidence	T4, T7	⬜ Ready to edit
IHS4 ICC justification — cite Wiala 2024 + Goldfarb 2021 (IHS4 ICC range 0.47–>0.75 across studies; 0.727 within upper-moderate/high range)	T5, T7	⬜ Ready to edit
COVIDX utility criterion — cite Roca 2022, Mostafa 2022, Romero-Jimenez 2022 as SotA benchmarks for Clinical Utility Score >= 8	T6	⬜ Ready to edit
Representativeness — phototype section (hybrid Option A + B; cite Tepedino 2024, Walker 2025, Dulmage 2021 + §6.5(e) Liu 2023, Tjiu 2025)	T8, T7	⬜ Ready to edit
Representativeness — pediatric section (hybrid: cite Yu 2025 AUC 0.91 + §6.5(e) field-wide gap; only 1 of 26 T9 papers qualified; PMCF commitment)	T9	⬜ Ready to edit
Gap 2 §6.5(e) declaration — severity Pillar 3 (cite Schaap 2022 ICC 0.58–0.79, Ali 2022 ICC 0.86–0.90 as SotA; 0.727 within range; §6.5(e) gap justified)	T10	⬜ Ready to edit
Gap 4 §6.5(e) declaration — autoimmune diseases (cite Mathur 2021 CNN 86.7% + Yu 2025 AUC 0.91; 2/105 results confirms field-wide gap)	T11	⬜ Ready to edit
UAS severity α contextualisation — cite Hollis 2018 (κ = 0.78–0.82) + Jauregui 2019 (ICC = 0.84) as context; frame α = 0.603 as PMCF baseline	T12	⬜ Ready to edit

SotA edits (R-TF-015-011)

Section	Triggered by	Status
NMSC malignancy section — add Jones 2022, Jaklitsch 2023, Ferris 2025, Walton 2026 (T4) + Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 (T7)	T4, T7	⬜ Ready to edit
IHS4 severity section — add Wiala 2024 (AUC 0.84–0.89; expert inter-rater ICC 0.68–0.78 as benchmark)	T5	⬜ Ready to edit
Teledermatology utility section — add Roca 2022, Mostafa 2022, Romero-Jimenez 2022 as benchmarks for Clinical Utility Score acceptance criterion	T6	⬜ Ready to edit
Phototype / skin diversity section — add Walker 2025, Dulmage 2021, Tepedino 2024 (no significant V–VI gap) + Tjiu 2025, Liu 2023, Groh 2024 (field-wide gap context)	T8, T7	⬜ Ready to edit
Pediatric AI dermatology section — add Yu 2025 (childhood vitiligo AUC 0.91, outperforms dermatologists AUC 0.77); note field-wide gap (1 qualifying paper in 26)	T9	⬜ Ready to edit
AI severity assessment section (non-IHS4) — add Schaap 2022 (PASI CNN, ICC 0.58–0.79), Ali 2022 (EASI/SCORAD smartphone, ICC 0.86–0.90), Seité 2019 (acne GEA AI, 68% agreement)	T10	⬜ Ready to edit
Autoimmune skin disease AI section — add Mathur 2021 (BP/lupus CNN 86.7%), Yu 2025 (vitiligo AUC 0.91); note field-wide gap (only 2 qualifying papers in 105-paper search)	T11	⬜ Ready to edit

BSI response edits

Section	Triggered by	Status
Item 5 — NMSC_2025 appraisal (cite Jones 2022, Jaklitsch 2023, Ferris 2025, Walton 2026 as primary care context; explain 80% malignancy prevalence in NMSC_2025 as H&N surgery clinic; contextualise with SotA)	T4	⬜ Ready to edit

CEP edits (R-TF-015-001)

Section	Triggered by	Status
PMCF commitment — Fitzpatrick phototype stratification (add commitment to report performance by Fitzpatrick group; cross-ref §6.5(e) acceptable gap declaration in CER)	T2, T8	⬜ Ready to edit
PMCF commitment — pediatric proportion monitoring (add commitment to track pediatric case proportion; cross-ref §6.5(e) gap; baseline: 6.3% of current evidence base)	T9	⬜ Ready to edit
PMCF commitment — autoimmune disease monitoring (add commitment to monitor autoimmune disease case outcomes; cross-ref Gap 4 §6.5(e) in CER; baseline: 3% of current use cases)	T11	⬜ Ready to edit
PMCF commitment — UAS severity baseline (add commitment to track device UAS severity agreement trajectory; baseline: Krippendorff α = 0.603; reference ceiling: Hollis 2018 κ = 0.78–0.82)	T12	⬜ Ready to edit

Task tracker​

T1: Fix melanoma criterion inconsistency​

Edit instructions — CER line 818​

Answer​

T2: Formally declare Fitzpatrick V–VI as acceptable gap​

Decision summary​

Edit instructions — CER §6.5(e) section (around line 1951)​

Answer​

T3: Strengthen alopecia dermatologist sub-criteria justification​

Context​

Edit instructions — CER lines 1833–1834​

Answer​

T4: Literature search A1 — BCC/cSCC AI in non-specialist settings​

Eligibility screening (15 results)​

CRIT1-7 scoring — included papers​

Key data extracted​

Jones et al. 2022 — Lancet Digit Health (8/10)​

Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)​

Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study​

Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)​

Answer — use in CER and SotA​

T5: Literature search A2 — IHS4 AI independent validation​

Eligibility screening​

CRIT1-7 scoring — Wiala et al. 2024​

Key data extracted — Wiala et al. 2024​

Answer — use in CER and SotA​

T6: Literature search A3 — Teledermatology utility scale benchmarks​

Eligibility screening (12 results)​

CRIT1-7 scoring — included papers​

Key data extracted​

Roca et al. 2022 — Int J Environ Res Public Health (8/10)​

Mostafa & Hegazy 2022 — J Dermatolog Treat (7/10)​

Romero-Jimenez et al. 2022 — Front Immunol (6/10)​

Answer — use in CER and SotA​

T7: Re-examine existing high-weight SotA articles​

Summary screening​

New findings — Goldfarb et al. 2021 (IHS4 ICC benchmark)​

New findings — Tepedino et al. 2024 (Fitzpatrick V data)​

Reinforcement of T4 findings — additional BCC/cSCC non-specialist benchmarks​

Answer​

T8: Literature search B1 — Fitzpatrick V–VI AI dermatology​

Eligibility screening (21 results)​

CRIT1-7 scoring — included papers​

Key data — highest-priority papers​

Tjiu & Lu 2025 — Medicina (10/10) — Meta-Analysis​

Walker et al. 2025 — Oncology (10/10)​

Groh et al. 2024 — Nat Med (9/10)​

Liu et al. 2023 — Dermatology (8/10) — Systematic Review​

Kamulegeya et al. 2023 — Afr Health Sci (9/10)​

Dulmage et al. 2021 — J Invest Dermatol (8/10)​

Answer — T2 decision triggered​

T9: Literature search B2 — Pediatric AI dermatology​

Eligibility screening (26 results)​

CRIT1-7 scoring — included paper​

Key data extracted — Yu et al. 2025​

Papers redirected to other tasks​

Answer — use in CER​

T10: Literature search B3 — Severity Pillar 3 real-world clinical studies​

Eligibility screening (4 results + 1 redirect)​

CRIT1-7 scoring — included papers​

Key data extracted​

Schaap et al. 2022 — J Eur Acad Dermatol Venereol (10/10)​

Ali et al. 2022 — Skin Res Technol (9/10)​

Ali et al. 2024 — Dermatology (6/10) — contextual gap evidence​

Seité et al. 2019 — Exp Dermatol (7/10)​

Answer — use in CER Gap 2 declaration​

T11: Literature search C1 — Autoimmune skin disease AI detection​

Eligibility screening — image-based AI papers (subset of 105)​

CRIT1-7 scoring — included papers​

Key data extracted​

Mathur et al. 2021 — Dermatol Ther (9/10)​

Yu et al. 2025 — Photodiagnosis Photodyn Ther (9/10) [cross-reference from T9]​

Bharathi et al. 2023 — Rheumatology (7/10) — SotA context​

Garaiman et al. 2023 — Rheumatology (6/10) — SotA context​

Answer — use in CER Gap 4 declaration​

T12: Literature search C2 — UAS inter-rater benchmarks​

Eligibility screening — urticaria-relevant papers (9 of 27)​

CRIT1-7 scoring — included papers​

Key data extracted​

Hollis et al. 2018 — Am J Clin Dermatol (10/10)​

Task tracker

T1: Fix melanoma criterion inconsistency

Edit instructions — CER line 818

Answer

T2: Formally declare Fitzpatrick V–VI as acceptable gap

Decision summary

Edit instructions — CER §6.5(e) section (around line 1951)

Answer

T3: Strengthen alopecia dermatologist sub-criteria justification

Context

Edit instructions — CER lines 1833–1834

Answer

T4: Literature search A1 — BCC/cSCC AI in non-specialist settings

Eligibility screening (15 results)

CRIT1-7 scoring — included papers

Key data extracted

Jones et al. 2022 — Lancet Digit Health (8/10)

Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)

Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study

Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)

Answer — use in CER and SotA

T5: Literature search A2 — IHS4 AI independent validation

Eligibility screening

CRIT1-7 scoring — Wiala et al. 2024

Key data extracted — Wiala et al. 2024

Answer — use in CER and SotA

T6: Literature search A3 — Teledermatology utility scale benchmarks

Eligibility screening (12 results)

CRIT1-7 scoring — included papers

Key data extracted

Roca et al. 2022 — Int J Environ Res Public Health (8/10)

Mostafa & Hegazy 2022 — J Dermatolog Treat (7/10)

Romero-Jimenez et al. 2022 — Front Immunol (6/10)

Answer — use in CER and SotA

T7: Re-examine existing high-weight SotA articles

Summary screening

New findings — Goldfarb et al. 2021 (IHS4 ICC benchmark)

New findings — Tepedino et al. 2024 (Fitzpatrick V data)

Reinforcement of T4 findings — additional BCC/cSCC non-specialist benchmarks

Answer

T8: Literature search B1 — Fitzpatrick V–VI AI dermatology

Eligibility screening (21 results)

CRIT1-7 scoring — included papers

Key data — highest-priority papers

Tjiu & Lu 2025 — Medicina (10/10) — Meta-Analysis

Walker et al. 2025 — Oncology (10/10)

Groh et al. 2024 — Nat Med (9/10)

Liu et al. 2023 — Dermatology (8/10) — Systematic Review

Kamulegeya et al. 2023 — Afr Health Sci (9/10)

Dulmage et al. 2021 — J Invest Dermatol (8/10)

Answer — T2 decision triggered

T9: Literature search B2 — Pediatric AI dermatology

Eligibility screening (26 results)

CRIT1-7 scoring — included paper

Key data extracted — Yu et al. 2025

Papers redirected to other tasks

Answer — use in CER

T10: Literature search B3 — Severity Pillar 3 real-world clinical studies

Eligibility screening (4 results + 1 redirect)

CRIT1-7 scoring — included papers

Key data extracted

Schaap et al. 2022 — J Eur Acad Dermatol Venereol (10/10)

Ali et al. 2022 — Skin Res Technol (9/10)

Ali et al. 2024 — Dermatology (6/10) — contextual gap evidence

Seité et al. 2019 — Exp Dermatol (7/10)

Answer — use in CER Gap 2 declaration

T11: Literature search C1 — Autoimmune skin disease AI detection

Eligibility screening — image-based AI papers (subset of 105)

CRIT1-7 scoring — included papers

Key data extracted

Mathur et al. 2021 — Dermatol Ther (9/10)

Yu et al. 2025 — Photodiagnosis Photodyn Ther (9/10) [cross-reference from T9]

Bharathi et al. 2023 — Rheumatology (7/10) — SotA context

Garaiman et al. 2023 — Rheumatology (6/10) — SotA context

Answer — use in CER Gap 4 declaration

T12: Literature search C2 — UAS inter-rater benchmarks

Eligibility screening — urticaria-relevant papers (9 of 27)

CRIT1-7 scoring — included papers

Key data extracted

Hollis et al. 2018 — Am J Clin Dermatol (10/10)