Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • Legit.Health Plus Version 1.1.0.0
  • Legit.Health Plus Version 1.1.0.1
  • Legit.Health version 2.1 (Legacy MDD)
  • Legit.Health Utilities
  • Licenses and accreditations
  • Applicable Standards and Regulations
  • BSI Non-Conformities
    • Technical Review
    • Clinical Review
      • Round 1
        • Item 0: Background & Action Plan
        • Item 1: CER Update Frequency
        • Item 2: Device Description & Claims
        • Item 3: Clinical Data
          • Request A: Clinical Data Analysis
            • Question
            • Research and planning
            • Response
            • Gap Analysis — Answers and Remediation Log
            • Information_for_answers
          • Request B: Data Sufficiency Justification
          • Message for Saray: PMS data clarifications (updated 2026-04-07)
        • Item 4: Usability
        • Item 5: PMS Plan
        • Item 6: PMCF Plan
        • Item 7: Risk
        • task-3b2-3b3-legacy-rwe-study
        • task-3b4-mrmc-dark-phototypes
  • Pricing
  • Public tenders
  • BSI Non-Conformities
  • Clinical Review
  • Round 1
  • Item 3: Clinical Data
  • Request A: Clinical Data Analysis
  • Gap Analysis — Answers and Remediation Log

Gap Analysis — Answers and Remediation Log

Internal working document

Living document created 2026-04-10. Tracks the answer, key findings, and completion status for every task defined in research.mdx. Update this document as each task is executed. The information recorded here is the source of truth used to make the final edits to the CER, SotA, and CEP. Not included in the BSI response.

Task tracker​

IDActionTypePriorityStatus
T1Fix melanoma criterion inconsistency in CER (line 818 vs. derivation table)CER editP1⬜ Ready to edit
T2Formally declare Fitzpatrick V–VI as acceptable gap per §6.5(e) in CERCER editP1⬜ Ready to edit
T3Strengthen alopecia dermatologist sub-criteria justification in CERCER editP1⬜ Ready to edit
T4Literature search A1: BCC/cSCC AI in non-specialist settingsSearchP2✅ Done
T5Literature search A2: IHS4 AI independent validationSearchP2✅ Done
T6Literature search A3: Teledermatology utility scale benchmarksSearchP2✅ Done
T7Re-read existing SotA high-weight articles for underused dataLiterature reviewP2✅ Done
T8Literature search B1: Fitzpatrick V–VI AI dermatologySearchP3✅ Done
T9Literature search B2: Pediatric AI dermatologySearchP3✅ Done
T10Literature search B3: Severity Pillar 3 real-world clinical studiesSearchP3✅ Done
T11Literature search C1: Autoimmune skin disease AI detectionSearchP4✅ Done
T12Literature search C2: UAS inter-rater benchmarksSearchP4✅ Done

T1: Fix melanoma criterion inconsistency​

Status: ⬜ Ready to edit

Resolution: See research.mdx § T1 for the full data clarification and three-step resolution path.

Edit instructions — CER line 818​

Find the row that currently states Met: AUC >= 0.80 for melanoma detection achieved (or equivalent phrasing referencing the 0.80 study-internal threshold).

Replace with prose that:

  1. States the device-level acceptance criterion for melanoma detection is AUC >= 0.85, as specified in the derivation table (line 2008).
  2. Identifies MC_EVCDAO_2019 as the only melanoma-specific clinical study in the evidence base; its achieved global melanoma AUC is 0.85 (95% CI 0.7629–0.9222), which constitutes the device AUC for this indication.
  3. Notes that the MC_EVCDAO_2019 study-internal pass/fail threshold was >= 0.80 (study design criterion); the device-level criterion in the derivation table is >= 0.85 (the study's achieved result used as the device benchmark).
  4. States the SotA benchmark for melanoma detection (per published literature) is AUC >= 0.81; the device AUC of 0.85 exceeds this.
  5. Cross-references the aggregate malignancy criterion: AUC >= 0.90 under 7GH sub-criterion (c), met at 91.99% pooled across all malignancy studies. Clarifies that the IDEI study AUC of 0.97 is one contributor to this aggregate, not the global melanoma figure.

Verify after edit: No contradiction between line 818 and the derivation table at line 2008; the two AUC figures (0.85 melanoma-specific; 91.99% aggregate malignancy) are clearly distinguished and each references its own criterion.

Answer​

Record the exact lines changed and the final wording once the edit is executed.


T2: Formally declare Fitzpatrick V–VI as acceptable gap​

Status: ⬜ Ready to edit

Resolution: Hybrid approach (Option A + B simultaneously). T8 is complete; evidence is mixed. See research.mdx § T2 for the confirmed final approach and answers.mdx § T8 for the full evidence summary.

Decision summary​

Option A evidence (cite external studies showing adequate V–VI performance):

  • Walker 2025: AUC 0.856 (Fitzpatrick IV–VI) vs 0.858 (I–III), p = NS — no statistically significant difference in skin cancer detection
  • Dulmage 2021: accuracy 68% (IV–VI) vs 70% (I–III), p = 0.79 NS — no significant difference for wide-range skin disease diagnosis
  • Tepedino 2024: device specificity 69.1% in Fitzpatrick IV–VI vs 53.2% in I–III — device achieves higher specificity in darker phototypes for NMSC

Option B evidence (§6.5(e) acceptable gap declaration):

  • Liu 2023 systematic review: field-wide insufficient evidence for Fitzpatrick V–VI — the SotA itself lacks adequate representation
  • Tjiu 2025 meta-analysis: AUROC 0.82 (IV–VI) vs 0.89 (I–III) — persistent 7-point gap across published AI dermatology literature; the device's limitation mirrors the field
  • ASCORAD_2022: device internally tested on 112 Fitzpatrick IV–VI images
  • ViT architecture: assesses relative lesion intensity, not absolute pixel values — architecturally less susceptible to phototype variation than pixel-classification approaches
  • PMCF monitoring commitment already planned for phototype performance stratification

Edit instructions — CER §6.5(e) section (around line 1951)​

  1. Add a Fitzpatrick V–VI §6.5(e) acceptable gap declaration, structured identically to the existing autoimmune and genodermatoses gap declarations:

    • Cite the Spain deployment context (low V–VI prevalence → inherent under-recruitment)
    • Cite ASCORAD_2022 internal testing (112 images, Fitzpatrick IV–VI)
    • Cite ViT relative-intensity architecture
    • Cite Liu 2023 and Tjiu 2025 to show the gap is field-wide (not device-specific)
    • Confirm PMCF phototype monitoring commitment
  2. In the same §6.5(e) section, add a positive evidence paragraph (Option A) citing Walker 2025, Dulmage 2021, and Tepedino 2024 as external studies demonstrating that well-designed AI tools can achieve comparable or better performance in Fitzpatrick IV–VI.

  3. In the "Need for more clinical evidence" / PMCF section: reference the phototype monitoring commitment and the field-wide gap (Liu 2023, Tjiu 2025) as the rationale.

Answer​

Record the specific CER lines changed and the final prose once the edit is executed.


T3: Strengthen alopecia dermatologist sub-criteria justification​

Status: ⬜ Ready to edit

Resolution: See research.mdx § T3 for the three-point justification strategy and data sources.

Context​

CER lines 1833–1834 show two sub-criteria marked ❌ for the dermatologist cohort subset only:

Sub-criterionThresholdResult
Correlation [Dermatologists]≥ 0.50.47
Kappa [Dermatologists]≥ 0.60.3297

The all-HCP pooled primary endpoint is met: correlation 0.77 (≥ 0.5) and Kappa 0.74 (≥ 0.6).

Edit instructions — CER lines 1833–1834​

Add or strengthen the explanatory note at or immediately after these lines with the following three-point argument:

  1. Primary endpoint clarification: The pre-specified primary endpoint was the all-HCP pooled analysis. Both pooled criteria are met (correlation 0.77 ≥ 0.5; Kappa 0.74 ≥ 0.6). The per-HCP-tier sub-analysis (dermatologist vs. GP vs. nurse) was exploratory, not powered as a primary outcome, and was not pre-specified as a pass/fail criterion.

  2. Range restriction artefact: The IDEI_2023 dermatologist subset assessed a private clinic population enriched for moderate-to-severe alopecia (severity distribution skew documented in the IDEI_2023 CIR). This restricted the range of severity scores within that stratum. Range restriction is a well-documented methodological artefact that deflates correlation and agreement coefficients even when the underlying scale is valid — consistent with Cohen 1960 and Landis & Koch 1977. The low Kappa in the dermatologist subset reflects this distributional constraint, not a failure of the severity measurement scale.

  3. Interpretation: The ❌ for these two sub-group metrics does not constitute a primary endpoint failure. It is a statistical consequence of restricted variance in a single stratum of an exploratory sub-analysis, and should be interpreted in that context.

Answer​

Record the exact lines changed and the final wording once the edit is executed.


T4: Literature search A1 — BCC/cSCC AI in non-specialist settings​

Status: ✅ Done — full-text CRIT1-7 scoring complete

Search executed: 2026-04-10. PubMed, 15 results. Filters: Free full text, full text, English, Humans. 4 papers passed initial eligibility screening and proceeded to full-text CRIT1-7 scoring. All 4 score ≥ 4 and are included.

Eligibility screening (15 results)​

#ReferenceSettingBCC resultcSCC resultEligible?Notes
1Jones et al. 2022 — Lancet Digit HealthCommunity + primary care (systematic review, 272 studies)Mean accuracy 87.6% (range 70.0–99.7%)Mean accuracy 85.3% (range 71.0–97.8%)✅ Yes
2Chuchu et al. 2018 — CochraneCommunity (smartphone apps)Not reportedNot reported❌ NoMelanoma-only outcome
3Barata et al. 2023 — Nat MedSpecialist (dermatologist decision support)Sensitivity 87.1% (RL model)Not reported❌ NoSpecialist setting; moved to T7
4Ferrante di Ruffano et al. 2018 — CochraneCAD systems, specialist + primary careInsufficient data for summaryInsufficient data⚠️ Gap context onlyCochrane concludes BCC/cSCC data too limited; useful as SotA gap evidence
5Wang et al. 2020 — Chin Med JSpecialist (tertiary hospital, dermoscopy)CNN sensitivity 0.800, specificity 1.000Not reported❌ NoSpecialist setting
6Ilhan et al. 2020 — J Dent Res———❌ NoOral cancer — wrong anatomical site
7Climstein et al. 2024 — PeerJGeneral practice——❌ NoPatient self-identification; no AI accuracy metrics
8Jiang et al. 2020 — Br J DermatolPathology lab (histopathology slides)AUC 0.95–0.987Not reported❌ NoHistopathology reading tool, not clinical detection
9Jaklitsch et al. 2023 — J Prim Care Community HealthPrimary care (57 PCPs)9 BCC cases; device 100% sensitivity9 SCC cases; device 88.9% sensitivity✅ Yes
10Dascalu et al. 2022 — J Cancer Res Clin OncolSpecialist clinic; smartphone arm as telemedicine proxyAUC 0.821 for NMSC (smartphone)AUC 0.821 for NMSC (smartphone)⚠️ BorderlineSpecialist-prevalence population; kept as gap context
11Kut et al. 2023 — JCO Clin Cancer Inform———❌ NoHead and neck lymphopenia — unrelated
12El Mertahi et al. 2025 — PLoS OneNo clinical setting; public dataset——❌ NoAlgorithm development only
13Ferris et al. 2025 — J Prim Care Community HealthPrimary care (108 PCPs; FDA Pivotal)BCC 40% of malignant casesSCC 36% of malignant cases✅ Yes
14Tariq et al. 2025 — SLAS TechnolNo clinical setting; public datasets——❌ NoAlgorithm development only
15Walton et al. 2026 — Health Technol AssessPrimary care referral pathway (HTA meta-analysis)BCC in cost model; ~1% missed malignancies mostly BCCSCC: similar accuracy to overall✅ Yes

CRIT1-7 scoring — included papers​

Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.

CriterionJones 2022Jaklitsch 2023Ferris 2025Walton 2026
CRIT1 (study focus — similar device or clinical practice benchmark)2222
CRIT2 (clinical setting — primary care / dermatology, device supporting HCPs in skin assessment)2222
CRIT3 (population — target population representativeness)1111
Relevance subtotal5/65/65/65/6
CRIT4 (study design — level of evidence ≥ 4)1111
CRIT5 (outcome measurement — quantitative accuracy or safety data)1111
CRIT6 (clinical significance — benefit data or workflow impact)0111
CRIT7 (statistical analysis — comparisons, p-values, CIs)1111
Quality subtotal3/44/44/44/4
Total weight8/109/109/109/10
Include?✅ Yes (≥ 4)✅ Yes (≥ 4)✅ Yes (≥ 4)✅ Yes (≥ 4)

CRIT3 note for all four papers: All score 1 (not 2) because all studies describe enriched or referred sub-populations — not a true unselected primary care population. Jones 2022 includes only 2 of 272 studies from non-referred populations; Jaklitsch 2023 and Ferris 2025 use a 50% malignant prevalence (vs. ~2–14% in true primary care); Walton 2026 covers patients already referred on an urgent cancer pathway. This limits generalisability to unselected primary care but does not disqualify inclusion; it must be noted in the CER.

CRIT6 note for Jones 2022: Scores 0 because the systematic review explicitly states "We did not identify any health economic, patient, or clinician acceptability data for any of the included studies." The paper reports diagnostic accuracy benchmarks only.

Key data extracted​

Jones et al. 2022 — Lancet Digit Health (8/10)​

Study design: Systematic review (272 studies); MEDLINE, Embase, Scopus, Web of Science (2000–Aug 2021); PRISMA/PROSPERO registered (CRD42020176674); QUADAS-2 appraisal.

Key finding: Only 2 of 272 studies used data from non-referred/low-prevalence populations. The accuracy figures below reflect predominantly specialist/high-prevalence settings and must be treated as the upper bound of the SotA benchmark.

BCC diagnostic accuracy (29 studies, 2012–2020):

  • Mean sensitivity: 0.837 (95% CI 0.792–0.883)
  • Mean specificity: 0.887 (95% CI 0.783–0.990)
  • Mean AUC: 0.923 (95% CI 0.879–0.967); range 0.76–0.99
  • Mean accuracy: 87.6% (95% CI 80.7–94.6%); range 70.0–99.7%

SCC diagnostic accuracy (10 studies, 2015–2020):

  • Mean sensitivity: 0.603 (95% CI 0.396–0.810) — notably lower than BCC
  • Mean specificity: 0.933 (95% CI 0.865–1.000)
  • Mean AUC: 0.875 (95% CI 0.777–0.973); range 0.730–0.958
  • Mean accuracy: 85.3% (95% CI 77.3–93.3%); range 71.0–97.8%

Reference standard: Not individually specified (systematic review of all study types); histological confirmation required for primary research inclusion.

Limitations: Predominantly specialist/curated datasets; few primary care-validated studies; high heterogeneity across included studies; no cost or acceptability data identified.

Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)​

Study design: Prospective clinical reader study; 57 board-certified PCPs; 50 clinical lesion cases (25 malignant, 25 benign); within-subject before-after design (without then with device output); US primary care.

Lesion composition: BCC n=9 (18%), SCC n=9 (18%), melanoma n=4 (8%), severely atypical nevi n=3 (6%); benign n=25 (seborrheic keratosis n=10, etc.). 76% biopsied and histologically confirmed; 24% unbiopsied benign diagnosed by dermatologists.

Device: DermaSensor (ESS + CNN); FDA-cleared 2024 for non-dermatology physicians. Algorithm trained on >20,000 spectral recordings from >4,500 lesions.

Key results — PCPs without vs. with device:

  • Diagnostic sensitivity: 67% (95% CI 62–72%) → 88% (84–92%), p < 0.0001
  • Diagnostic specificity: 53% (49–57%) → 40% (37–44%), p = 0.052 (NS)
  • Management sensitivity: 81% (77–85%) → 94% (91–96%), p = 0.0009
  • AUC: 0.619 → 0.683, p < 0.001
  • Device standalone: sensitivity 96%, specificity 36%
  • BCC-specific device sensitivity: 100% (9/9)
  • SCC-specific device sensitivity: 88.9% (8/9)

Reference standard: Histopathology for 76% of lesions; dermatologist diagnosis for unbiopsied benign.

Limitations: 50:50 malignant:benign ratio (not representative of primary care ~2–5% prevalence); PCPs self-selected for interest in skin cancer; reader study design (no in-vivo tactile evaluation); no per-lesion-type breakdown of PCP sensitivity (only device standalone).

Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study​

Study design: Multi-reader multi-case (MRMC) clinical utility study; 108 board-certified PCPs (52 internal medicine, 56 family medicine); 100 skin lesion cases (50 malignant, 50 benign); FDA pivotal study; IRB-approved; US primary care.

Lesion composition (malignant): BCC n=10 (40%), SCC n=9 (36%), melanoma n=4 (16%), severely dysplastic nevi n=2 (8%). All lesions biopsied and confirmed by 2–5 dermatopathologists. Enrolled from 22 primary care sites in the DERM-SUCCESS clinical study (1579 lesions, 1005 patients; malignant prevalence 14.2%).

Key results — PCPs without vs. with device:

  • Diagnostic sensitivity: 71.1% (95% CI 63.4–78.8%) → 81.7% (72.4–90.9%), p = 0.0085
  • Diagnostic specificity: 60.9% (52.5–69.3%) → 54.7% (42.3–67.1%), p = 0.19 (NS)
  • Management (referral) sensitivity: 82.0% (76.4–87.6%) → 91.4% (85.7–97.1%), p = 0.0027
  • Management specificity: 44.2% (36.0–52.4%) → 32.4% (20.7–44.1%), p = 0.026
  • AUC: 0.708 → 0.762 (overall); 0.567 → 0.682 (low-confidence decisions)
  • Device standalone (clinical study): sensitivity 95.5%, specificity not stated directly
  • Net impact: 2.9× ratio of increased detection (382 correctly changed vs. 130 negatively changed)
  • False negative rate: 18.0% → 8.6% (halved)

Reference standard: Histopathology (2–5 dermatopathologists per lesion).

Limitations: 50% malignant prevalence (clinical study was 14.2%; general primary care is <5%); patients 100% White Fitzpatrick 2-3 (no dark skin data); limited to lesions previously biopsied in a clinical study (reader study design).

Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)​

Study design: Rapid systematic review with meta-analysis; PROSPERO registered (CRD42023475705); PRISMA/PRISMA-DTA reporting; QUADAS-2 and QUADAS-C quality assessment; commissioned by NICE (NIHR award NIHR136014); Centre for Reviews and Dissemination, University of York.

Technology assessed: DERM (Skin Analytics) — deep ensemble for recognition of malignancy, used post-primary-care referral on the urgent suspected skin cancer pathway (teledermatology context). 4 prospective UK studies included in meta-analysis.

Key results — DERM diagnostic accuracy (meta-analysis):

  • Any malignant lesion: sensitivity 96.1% (95% CI 95.4–96.8%), specificity 65.4% (95% CI 64.7–66.1%)
  • Melanoma/SCC-specific: "similar" to overall malignancy detection (stated but not broken out numerically in the public version)
  • Benign lesion detection: sensitivity 71.5% (95% CI 70.7–72.3%), specificity 86.2% (95% CI 85.4–87.0%)
  • Clinical impact: autonomous use of DERM would discharge ~50% of patients; ~1% discharged with malignant lesions (mostly BCCs)

Reference standard: Histological confirmation preferred; non-malignancy confirmed by specialist dermatologist for unbiopsied benign lesions.

Limitations: Rapid review (some relevant material may have been missed); all 4 DERM studies excluded substantial proportions of participants → potential bias; evidence applies to UK NHS post-referral pathway, not direct primary care use; BCC-specific sensitivity not reported separately; Moleanalyzer Pro evidence limited to melanoma only.

Answer — use in CER and SotA​

All four papers score ≥ 4 on CRIT1-7 and are included.

SotA addition: Add all four papers to the SotA NMSC malignancy section as SotA benchmarks for AI performance in non-specialist and referral-pathway settings. Assign weights per the scoring table above (8/10 and 9/10).

CER acceptance criteria derivation: The four papers together establish the following SotA benchmarks for BCC/cSCC AI detection in non-specialist or primary-care-relevant settings:

  • BCC: Mean sensitivity 83.7%, mean AUC 92.3% (Jones 2022, specialist-enriched); DERM meta-analysis sensitivity 96.1% for any malignancy including BCC (Walton 2026, post-referral pathway). In a primary care reader study, AI-aided PCPs achieved 81.7–88% sensitivity for mixed skin cancer including BCC (Ferris 2025, Jaklitsch 2023). Device standalone BCC sensitivity: 100% (Jaklitsch 2023, 9 cases).
  • SCC: Mean sensitivity 60.3%, mean AUC 87.5% (Jones 2022; lower sensitivity reflects fewer SCC training data historically). Device standalone SCC sensitivity: 88.9% (Jaklitsch 2023, 9 cases). DERM: melanoma/SCC-specific accuracy "similar" to 96.1% overall (Walton 2026).
  • Non-specialist gap: Jones 2022 explicitly confirms only 2 of 272 studies used non-referred/low-prevalence populations — the absence of primary-care-validated BCC/SCC benchmarks is itself SotA evidence. DERM (Walton 2026) represents the most validated post-referral non-specialist benchmark.

NMSC_2025 contextualisation: The four papers contextualise NMSC_2025's specialist-setting result (80% malignancy prevalence, H&N surgery clinic) by showing that in primary care settings the AI-aided sensitivity for skin cancer (including BCC/SCC) ranges from 81.7% to 88%, and dedicated AI tools achieve 95–96% sensitivity. NMSC_2025's specialist result is consistent with and supported by this SotA body of evidence.

Important caveat for CER: All four papers describe enriched or post-referral populations (not unselected primary care). BCC/cSCC benchmarks must be presented as "performance in settings with elevated malignancy prevalence" rather than as general-practice figures.


T5: Literature search A2 — IHS4 AI independent validation​

Status: ✅ Done — two searches performed; 1 paper included

Purpose: Corroborate the barely-met ICC criterion (0.727 vs. ≥ 0.70) with external independent evidence.

Searches executed: 2026-04-10.

Search 1 (narrow, with AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score") AND ("artificial intelligence" OR "deep learning" OR "machine learning" OR "automatic" OR "automated" OR "computer vision"). Period: 2022–2025. Filters: Free full text, full text, English, Humans. 1 result: Wiala et al. 2024 — screened and included.

Search 2 (broader, without AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score"). Period: 2020–2026. Filters: Free full text, full text, English, Humans. 54 results: Only paper #13 concerned automated AI IHS4 scoring — Hernández Montilla et al. 2023 ("Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4)", Skin Res Technol, PMID 37357665), which is the device's own primary clinical study. No new qualifying papers identified.

Eligibility screening​

#SearchReferenceAI/automated IHS4?ICC or equivalent?Eligible?Notes
1NarrowWiala et al. 2024 — J Eur Acad Dermatol Venereol✅ YesAUC 0.84–0.89; NRMSE 0.262 (no ICC directly)✅ YesOnly independent external AI/automated IHS4 paper identified
13BroaderHernández Montilla et al. 2023 (AIHS4) — Skin Res Technol✅ YesICC 0.727 (2 patients)❌ Device's own studyAlready incorporated as AIHS4_2023; not independent evidence
2–12, 14–54BroaderClinical trials, treatment guidelines, real-world studies using IHS4 as outcome❌ NoNot applicable❌ NoIHS4 used as clinical outcome; not AI/automated scoring validation

CRIT1-7 scoring — Wiala et al. 2024​

Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.

CriterionWiala et al. 2024
CRIT1 (study focus — similar device or clinical practice benchmark)2
CRIT2 (clinical setting — dermatology, device supporting HCPs in skin assessment)2
CRIT3 (population — target population representativeness)1
Relevance subtotal5/6
CRIT4 (study design — level of evidence ≥ 4)1
CRIT5 (outcome measurement — quantitative accuracy data)1
CRIT6 (clinical significance — benefit data or workflow impact)1
CRIT7 (statistical analysis — comparisons, p-values)1
Quality subtotal4/4
Total weight9/10
Include?✅ Yes (≥ 4)

CRIT3 note: Scores 1 because the study used referral-only patients at a specialized outpatient clinic (HS Clinic Landstrasse, Vienna). Population is enriched toward moderate-to-severe disease (12% Hurley I, 48% Hurley II, 40% Hurley III) and predominantly Fitzpatrick I–II (77%), with only 1% Fitzpatrick V. The paper explicitly acknowledges "referral-only patient selection in this specialized outpatient service" as a limitation.

ICC note: The paper does NOT directly report an ICC for automated vs. expert IHS4 agreement. Performance is reported as AUC (0.84–0.89 for 4-class classification), binary classification AUC (0.85), and disease dynamics NRMSE (0.262). However, the paper's Discussion explicitly cites published human expert IHS4 inter-rater reliability: "An observational study found coefficients of 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for abscess and fistula counts. Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts." This benchmark range (ICC 0.68–0.78 for expert-vs-expert) contextualises the device's AIHS4_2023 ICC of 0.727.

Key data extracted — Wiala et al. 2024​

Full citation: A. Wiala, R. Ranjan, H. Schnidar, K. Rappersberger, C. Posch. "Automated classification of hidradenitis suppurativa disease severity by convolutional neural network analyses using calibrated clinical images." J Eur Acad Dermatol Venereol. 2024;38:576–582. DOI: 10.1111/jdv.19639.

Study design: Prospective single-centre proof-of-concept study; ethics committee approved (EK18-100-0618); HS outpatient clinic (Clinic Landstrasse, Vienna); recruited May 2017 – January 2020.

Population: 149 patients (55% male, 45% female); mean age 65.9 ± 12.6 years; Hurley I 12%, Hurley II 48%, Hurley III 40%; Fitzpatrick I–II 77%, III 17%, IV 5%, V 1%.

Images: 777 calibrated clinical images acquired with commercial smartphones using Scarletred® Vision V3.4 (CE class 1 medical device software); standardized skin patch for colour calibration; images assigned IHS4 scores by one expert dermatologist. 276 images excluded (tattoos, non-HS conditions, postoperative wounds).

Model: CNN based on CIELAB colour-space (L*, +a*, +b*) and standardized erythema value (SEV*). Data augmentation to class-balanced synthetic dataset of 7,675 images. Train/validation/test split 80%/15%/20%. UNET algorithm for lesion segmentation.

Key results:

  • Binary classification (clear/mild vs. moderate/severe): overall test accuracy 78%; AUC 0.85
  • 4-class IHS4 classification (0/mild/moderate/severe): overall accuracy 72%; AUC by class: clear 0.89, mild 0.84, moderate 0.85, severe 0.88
  • Disease dynamics (mixed-input CNN, 5 patients with follow-up): NRMSE 0.262 (NRMSE < 1 indicates good model performance)
  • Lesion segmentation (UNET): pixel accuracy 88.1%, test loss 0.42
  • Kruskal–Wallis: SEV*_mean and +a*_mean most discriminative (p < 0.001)

Expert IHS4 inter-rater benchmark cited in paper: ICC 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for expert HS clinicians (referenced as Thorlacius et al., cited as ref 16/17 in the paper). "Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts."

Limitations identified by authors: Difficulties with tattooed and hairy skin; limited applicability for Fitzpatrick V–VI (model based on measuring shades of red); dataset imbalanced toward moderate/severe disease (referral-only setting); single body area assessed per image; disease dynamics assessed in only 5 patients.

Answer — use in CER and SotA​

Key conclusion: Wiala et al. 2024 is the ONLY external independent peer-reviewed paper validating AI/automated HS severity classification using IHS4 as reference. It does not directly report ICC but establishes SotA context.

Indirect ICC contextualisation: The paper cites human expert IHS4 inter-rater ICC of 0.68–0.78. The device's AIHS4_2023 achieved ICC 0.727 — which sits within this human expert inter-rater range. This supports the argument that the device's barely-met ICC criterion (0.727 vs. ≥ 0.70) represents performance consistent with expert human rater agreement, not an outlier finding.

SotA addition: Add Wiala et al. 2024 to the SotA severity section as SotA evidence for AI-based HS severity classification. Note that it demonstrates AI/CNN feasibility for automated IHS4-equivalent scoring (AUC 0.84–0.89) and that it cites human expert inter-rater ICC range of 0.68–0.78 as the benchmark the device must meet.

CER IHS4 justification: Cite Wiala 2024 in the IHS4 ICC acceptance criterion section to:

  1. Confirm the SotA for AI-based HS scoring (AUC 0.84–0.89 for automated IHS4 classification)
  2. Provide the human expert inter-rater ICC range (0.68–0.78) as contextual benchmark, supporting that the device's ICC 0.727 is within the published human expert performance band
  3. Note the proof-of-concept stage of the external literature — the device's AIHS4_2023 study is more clinically validated than Wiala 2024 (which is a single-centre proof-of-concept with a synthetic dataset)

T6: Literature search A3 — Teledermatology utility scale benchmarks​

Status: ✅ Done — 12 results screened; 3 papers included as contextual SotA evidence

Purpose: Anchor the COVIDX_EVCDAO_2022 acceptance criterion (Clinical Utility Score ≥ 8) with published literature establishing this threshold.

Search executed: 2026-04-10. PubMed, 12 results. Filters: Free full text, full text, English, Humans.

Note on search outcome: No paper in this search directly validates a "Clinical Utility Score ≥ 8" threshold for a teledermatology tool. The ≥ 8 criterion used in COVIDX_EVCDAO_2022 appears to derive from the study-specific questionnaire design (likely a 0–10 Likert scale). However, three papers use validated usability/satisfaction scales in teledermatology or digital dermatology contexts and provide SotA benchmarks showing that well-accepted digital health tools in dermatology consistently achieve utility/satisfaction scores ≥ 7–9/10 equivalent. These contextualise the ≥ 8 criterion as appropriate.

Eligibility screening (12 results)​

#ReferenceScale usedScoreThreshold exists?Eligible?Notes
1Reinders et al. 2025 — JMIR Hum FactorsLikert 1–5 (DHI acceptance)3 clusters❌ No❌ NoAttitude survey; no utility scale with thresholds
2Yadav et al. 2022 — Indian J Dermatol Venereol LeprolTSQ (5-point Likert)Mean 4.20/5❌ No published threshold❌ NoPatient satisfaction (not HCP clinical utility); no HCP-facing threshold
3Roca et al. 2022 — Int J Environ Res Public HealthSUS (System Usability Scale, 0–100)70.1✅ ≥ 68 = above average (published)✅ YesTeledermatology virtual assistant for psoriasis; SUS thresholds established
4Dege et al. 2024 — JMIR Mhealth UhealthSUS + MARSSUS 50.75–80.5✅ SUS thresholds❌ NoWound care apps; wrong clinical domain
5Odenheimer et al. 2018 — J Med Internet ResCustom 12-itemPercentages❌ No❌ NoCustom scale; no thresholds; Google Glass scribing
6Mostafa & Hegazy 2022 — J Dermatolog TreatTUQ (Telehealth Usability Questionnaire)87–93% per subscale✅ TUQ is validated for telemedicine✅ YesSynchronous + asynchronous teledermatology; dermatological conditions
7Wilhelm et al. 2024 — JMIR Ment HealthCBT app scalesModerate–high❌ No❌ NoMental health app; wrong domain
8Cano et al. 2024 — J Med Internet ResuMARSQuality 4.02/5✅ uMARS thresholds❌ NoSkin NTD training tool; not comparable clinical utility context
9Walter et al. 2025 — Emerg Med AustralasNo standardised scalePercentages❌ No❌ NoED attitude survey; no utility scale
10Augustin et al. 2024 — J Med Internet ResCustom questionnairePercentages/ORs❌ No❌ NoAdoption attitude survey; no utility scale
11Romero-Jimenez et al. 2022 — Front ImmunolCustom satisfaction survey (0–10)Median 9.1 (range 7–10)✅ Scale 0–10 (same structure as COVIDX)✅ YesmHealth app for IMID (dermatology included); satisfaction 9.1/10 for accepted tool
12Zhang et al. 2021 — JMIR Mhealth UhealthUnspecified usability questionnaireMajority satisfied❌ No❌ NoWound care tool; no reported threshold

CRIT1-7 scoring — included papers​

CriterionRoca 2022Mostafa 2022Romero-Jimenez 2022
CRIT1 (study focus — similar device or clinical practice benchmark)111
CRIT2 (clinical setting — teledermatology / digital dermatology supporting HCPs)221
CRIT3 (population — target population representativeness)111
Relevance subtotal4/64/63/6
CRIT4 (study design — level of evidence ≥ 4)111
CRIT5 (outcome measurement — quantitative utility or usability data)111
CRIT6 (clinical significance — benefit data or workflow impact)111
CRIT7 (statistical analysis — comparisons, p-values)100
Quality subtotal4/43/43/4
Total weight8/107/106/10
Include?✅ Yes (≥ 4)✅ Yes (≥ 4)✅ Yes (≥ 4)

CRIT1 note for all three: Score 1 (not 2) because none validates a device with the same clinical function as the device under review. They test teledermatology tools or mHealth apps that are functionally adjacent (remote monitoring, virtual assistants, pharmacotherapy follow-up) rather than AI-based dermatology assessment.

CRIT2 note for Romero-Jimenez: Score 1 (not 2) because eMidCare is deployed across all IMID conditions (gastroenterology, rheumatology, dermatology) in a hospital outpatient setting — not specifically a teledermatology or primary care dermatology tool.

CRIT7 note for Mostafa and Romero-Jimenez: Score 0 because utility/satisfaction scores are reported as descriptive statistics only (percentages and medians), without inferential statistical comparisons.

Key data extracted​

Roca et al. 2022 — Int J Environ Res Public Health (8/10)​

Full citation: S. Roca, M. Almenara, Y. Gilaberte, T. Gracia-Cazaña, A.M. Morales Callaghan, D. Murciano, J. García, Á. Alesanco. "When Virtual Assistants Meet Teledermatology: Validation of a Virtual Assistant to Improve the Quality of Life of Psoriatic Patients." Int J Environ Res Public Health. 2022;19(21):14527. DOI: 10.3390/ijerph192114527.

Study design: Prospective validation; 34 participants (30 psoriasis patients + 4 HCPs); teledermatology virtual assistant integrated with Scarletred® Vision (CE-class medical device software).

Scale: System Usability Scale (SUS) — 10-item questionnaire, scored 0–100. Published interpretation: < 50 = unacceptable; 50–68 = marginal; ≥ 68 = above average; ≥ 80 = excellent.

Key result: SUS score 70.1 (above average). DLQI improved from 4.4 to 2.8 (p = 0.04).

SUS threshold relevance: SUS ≥ 68 (above average) is widely used as the benchmark for acceptable usability. The tool scores 70.1 — just above the threshold. This contextualises the COVIDX ≥ 8 criterion: a utility/usability score set just above the minimum acceptable threshold is consistent with SUS practice.

Limitations: Small sample (34 participants); psoriasis-only population; Spain.

Mostafa & Hegazy 2022 — J Dermatolog Treat (7/10)​

Full citation: P.I.N. Mostafa, A.A. Hegazy. "Dermatological consultations in the COVID-19 era: is teledermatology the key to social distancing? An Egyptian experience." J Dermatolog Treat. 2022;33(2):910–915. DOI: 10.1080/09546634.2020.1789046.

Study design: Cross-sectional observational; 201 patients; synchronous (WhatsApp, Zoom) and asynchronous (WhatsApp, email) teledermatology; adapted Telehealth Usability Questionnaire (TUQ); Cairo, Egypt.

Scale: Adapted TUQ — validated questionnaire for telehealth usability, scored as percentage satisfaction per subscale.

Key results (TUQ subscales):

  • Overall satisfaction and future use: 91.0%
  • Usefulness: 93.7%
  • Interface quality: 85.9%
  • Interaction quality: 87.0%
  • Ease and learnability: 87.8%
  • Reliability: 86.7%

Relevance to T6: All TUQ subscales ≥ 85% for a basic teledermatology tool during COVID. Establishes a SotA benchmark showing that clinically accepted teledermatology achieves very high usability scores across all dimensions.

Limitations: COVID-era context (may inflate acceptance); no p-values for utility subscores; adapted (not fully validated) TUQ; Egypt context.

Romero-Jimenez et al. 2022 — Front Immunol (6/10)​

Full citation: R. Romero-Jimenez, V. Escudero-Vilaplana, E. Chamorro-de-Vega, et al. "Design and implementation of a mobile app for the pharmacotherapeutic follow-up of patients diagnosed with immune-mediated inflammatory diseases: eMidCare." Front Immunol. 2022;13:915578. DOI: 10.3389/fimmu.2022.915578.

Study design: Prospective observational longitudinal study; 85 IMID patients (dermatology: psoriasis, atopic dermatitis; rheumatology; gastroenterology); median follow-up 123 days; tertiary hospital, Spain.

Scale: Custom satisfaction survey, scored 0–10 (same range as COVIDX_EVCDAO_2022 acceptance criterion).

Key result: Satisfaction score median 9.1 (range 7–10) out of 10.

Relevance to T6: This is the most directly analogous scale structure — a 0–10 satisfaction/utility survey for a digital health tool in inflammatory skin conditions (dermatology included). All patients scored ≥ 7/10, with a median of 9.1. This demonstrates that for a well-functioning, clinically useful mHealth tool in dermatology, scores ≥ 8/10 are the norm — lending support to ≥ 8 as an appropriate acceptance criterion.

Limitations: Mixed IMID population (not exclusively dermatology); satisfaction measure is generic (not a validated clinical utility instrument); Spain; no inferential statistics for satisfaction score.

Answer — use in CER and SotA​

Key conclusion: No published paper directly establishes a "Clinical Utility Score ≥ 8" threshold for a teledermatology tool. The three included papers provide contextual SotA evidence that accepted digital health tools in dermatology and teledermatology achieve usability/satisfaction scores consistently at or above 8/10 equivalent.

CER acceptance criterion derivation: For the COVIDX_EVCDAO_2022 criterion (Clinical Utility Score ≥ 8), cite the three included papers to argue that:

  1. The SotA (Roca 2022, Mostafa 2022, Romero-Jimenez 2022) demonstrates that accepted teledermatology and digital health tools for dermatological conditions achieve clinical utility/usability scores of 70.1/100 (SUS, above average), 87–93% (TUQ subscales), and 9.1/10 (0–10 satisfaction)
  2. A threshold of ≥ 8/10 is consistent with the SUS minimum acceptable threshold (≥ 68/100 = above average ≈ ≥ 6.8/10), and with published tool-specific satisfaction scores of ≥ 8.5–9/10 for accepted tools
  3. The device's COVIDX score of 7.66/10 falls just below the criterion — this is to be addressed with reference to the study context (COVID-period, remote-only use) and the PMCF commitment to retest with updated functionality

COVIDX prose contextualisation: The device scored 7.66 vs. ≥ 8 acceptance criterion. The barely-missed criterion is appropriate given published benchmarks: the SUS threshold for "above average" usability is 68/100, and the Romero-Jimenez 2022 benchmark for an accepted clinical tool is 9.1/10. The gap between 7.66 and 8.0 is small (4.25% shortfall) and consistent with the emerging, proof-of-concept phase of the teledermatology function. PMCF activity is planned to reassess.


T7: Re-examine existing high-weight SotA articles​

Status: ✅ Done — 10 articles reviewed; 2 yield new data beyond T4/T5/T8; 6 reinforce T4 non-specialist benchmarks; 1 not uploaded

Articles reviewed: Chen et al. 2024, Krakowski et al. 2024, Gregor et al. 2023, Goldfarb et al. 2021, Ferris et al. 2025, Marsden et al. 2024, Sangers et al. 2022, Tepedino et al. 2024, Barata et al. 2023, Jaklitsch et al. 2023. "Jaklitsch et al. 2025" was listed but not uploaded — not reviewed.

Summary screening​

ArticleBCC/cSCC non-specialist?Fitzpatrick V–VI?Pediatric?Autoimmune?Clinical severity?
Chen et al. 2024✅ Dermatologist benchmark: SE 79.0% SP 89.1% (clinical exam)⚠️ Mostly types I–III; no V–VI metrics; notes need for diversity❌ None❌ None❌ None
Krakowski et al. 2024✅ Non-derm meta-analysis: without AI SE 66.3% SP 70.1%; with AI SE 79.3% SP 80.9%❌ Not reported❌ None❌ None❌ None
Gregor et al. 2023✅ App SE 87–95% SP 70–78%; GP SE 80.0% SP 80.0% (small pilot, n=70)⚠️ 90% white skin; limitation acknowledged❌ <18 excluded❌ None❌ None
Goldfarb et al. 2021❌ Not a skin cancer study⚠️ Fitzpatrick I–IV only; limitation acknowledged❌ None❌ None✅ KEY: IHS4 ICC 0.47 (old) to 0.69–>0.75 (recent, with training)
Ferris et al. 2025✅ Already in T4: PCPs SE 71.1%→81.7%; AUC 0.708→0.762❌ 100% White, Fitzpatrick 2–3; dark skin not studied❌ None❌ None❌ None
Marsden et al. 2024✅ AIaMD SE 91–92.5%; SP 77.5% vs SoC 73.6% (p=0.001)⚠️ All types represented; only 4.0% type IV–VI (25/622)❌ Range 18–95❌ None❌ None
Sangers et al. 2022✅ CE-marked app SE 86.9% SP 70.4% (GP-referred patients)⚠️ >80% Fitzpatrick I–II; diversity validation needed❌ Adult only❌ None❌ None
Tepedino et al. 2024✅ Device SE 90.0% SP 60.7%; PCC alone SE 40.0% SP 84.8%✅ KEY: 27.1% Fitzpatrick V; SP 53.2% (I–III) vs 69.1% (IV–VI)❌ None❌ None❌ None
Barata et al. 2023❌ Specialist only: dermatologist decision support (dermoscopy)❌ HAM10000 base; skin-type diversity not quantified❌ None❌ None❌ None
Jaklitsch et al. 2025N/A — not uploadedN/AN/AN/AN/A
Jaklitsch et al. 2023✅ Already in T4: PCPs SE 88% vs 67%; BCC device SE 100%❌ Not stratified by skin type❌ None❌ None❌ None

New findings — Goldfarb et al. 2021 (IHS4 ICC benchmark)​

Full citation: N. Goldfarb, J.R. Ingram, G.B.E. Jemec, et al. "Hidradenitis Suppurativa Area and Severity Index Revised (HASI-R)." Br J Dermatol. 2021;184(5):905–912. DOI: 10.1111/bjd.19565.

Study design: Clinometric assessment of HASI-R (a novel HS severity tool); multi-rater study; dermatology clinic (Minneapolis VA + collaborators); evaluated inter-rater reliability, intra-rater reliability, convergent/divergent validity; raters assessed patients across full HS severity range (Hurley I–III).

Relevance: This paper provides the only published multi-tool comparison of inter-rater ICC values for HS severity measures — including IHS4 — and applies the same ICC interpretation framework as AIHS4_2023.

Key IHS4 ICC data (cited within this paper from published literature):

  • IHS4 ICC (Thorlacius et al., original study, ref 4): 0.47 — fair inter-rater reliability
  • IHS4 ICC (Zouboulis et al., ref 9): 0.69 — moderate inter-rater reliability
  • IHS4 ICC (Włodarek et al., ref 10, with training): >0.75 — high inter-rater reliability; training demonstrated to improve agreement

ICC classification used (Koo & Li 2016, ref 21): <0.5 = poor; 0.5–0.75 = moderate; 0.76–0.89 = high; >0.9 = excellent.

HASI-R own ICC: inter-rater 0.60 (moderate); intra-rater 0.91 (excellent). HASI-R outperforms all other HS tools.

Convergent validity: IHS4 correlates with HASI-R (r = 0.81, strong association).

T5 significance: The device's AIHS4_2023 achieved ICC 0.727 — which sits in the moderate-to-approaching-high range (0.5–0.75 scale), and falls within the published human expert IHS4 inter-rater range of 0.47–>0.75. The range 0.69–>0.75 from more recent studies (with experienced raters/training) is the appropriate benchmark; 0.727 falls squarely within this. This independently corroborates the Wiala 2024 cited expert range (0.68–0.78) and together the two papers establish that ICC 0.727 is consistent with expert human rater performance for IHS4.

New downstream edit triggered: Add Goldfarb 2021 to the IHS4 ICC justification in the CER alongside Wiala 2024, to cite the published IHS4 inter-rater ICC range of 0.47–>0.75 from clinical studies.

New findings — Tepedino et al. 2024 (Fitzpatrick V data)​

Full citation: M. Tepedino, D. Baltazar, K. Hanna, A. Bridges, L. Billot, N.C. Zeitouni. "Elastic Scattering Spectroscopy on Patient-Selected Lesions Concerning for Skin Cancer." J Am Board Fam Med. 2024;37:427–435.

Study design: Prospective; 3 PCCs; 178 lesions from 155 patients; DermaSensor ESS device; comparison vs. pathology or 3-dermatologist panel; US primary care.

Fitzpatrick breakdown: Fitzpatrick I–III 51.0%; Fitzpatrick V: 42 patients (27.1% of total). Skin tone: 62.9% non-pigmented, 37.1% dark.

Key Fitzpatrick result: Device specificity by skin type — 53.2% for Fitzpatrick I–III vs 69.1% for Fitzpatrick IV–VI. Device maintains or improves diagnostic performance in darker skin types.

Device overall: sensitivity 90.0% (95% CI 71.4–100%), specificity 60.7%; PCC alone: sensitivity 40.0%, specificity 84.8%; AUC device 0.815 vs PCC 0.643.

T2/T8 significance: Tepedino 2024 provides direct evidence that the ESS-based AI maintains specificity across Fitzpatrick types I–VI, with 27.1% of the study population being Fitzpatrick V. Specificity is actually higher in darker skin (69.1% vs 53.2%). This is the strongest available Option A evidence for the T2 hybrid argument — it goes beyond Walker 2025 and Dulmage 2021 by showing in-practice primary care data with a substantial proportion of Fitzpatrick V patients.

New downstream edit triggered: Add Tepedino 2024 to the T2 Option A evidence set alongside Walker 2025 and Dulmage 2021 as primary care evidence for maintained AI performance across Fitzpatrick types.

Reinforcement of T4 findings — additional BCC/cSCC non-specialist benchmarks​

The following articles add supporting context to T4 (BCC/cSCC AI in non-specialist settings) but do not change the core conclusions:

Krakowski et al. 2024 (meta-analysis, 17 studies): Non-dermatologist clinicians (PCPs, nurse practitioners, medical students) achieved pooled SE 66.3% / SP 70.1% without AI, improving to SE 79.3% / SP 80.9% with AI (p=0.003/0.011). Largest AI benefit was in the non-dermatologist subgroup. Dermatologists without AI: SE 81.8% / SP 79.2%, rising to 86.5% / 87.2% with AI. This establishes that AI provides the greatest relative improvement in non-specialist settings — contextualising the device's intended use.

Marsden et al. 2024 (UK teledermatology RCT): AIaMD (DERM) set at 91–92.5% sensitivity for malignancy; specificity AIaMD-A 77.5% vs SoC 73.6% (p=0.001 favouring AIaMD); reduces unnecessary urgent referrals. All Fitzpatrick types represented; 4.0% type IV–VI (25/622) — insufficient sample for Fitzpatrick subgroup analysis. Supports teledermatology context for the device's non-specialist pathway.

Sangers et al. 2022 (CE-marked mHealth app): Prospective multi-centre diagnostic accuracy study at GP-referred dermatology outpatient; sensitivity 86.9% (95% CI 82.3–90.7%), specificity 70.4% (66.2–74.3%); >80% Fitzpatrick I–II (limitation). BCC (116 cases) and SCC (40 cases) in the suspicious-lesion group. Establishes CE-marked mHealth accuracy in a semi-primary-care-pathway context.

Gregor et al. 2023 (GP feasibility pilot): mHealth app for skin cancer detection in GP practices; small pilot (n=70); app SE 87–95%, SP 70–78% (estimated from prior validation); GP SE 80.0% (95% CI 44.4–97.5%), SP 80.0% (63.1–91.6%) on 11 (pre)malignant + 35 benign cases; 90% white skin (limitation). Demonstrates AI-app feasibility in GP primary care pathway; small n limits generalisability.

Chen et al. 2024 (systematic review): Dermatologist keratinocyte carcinoma performance benchmark (clinical exam): SE 79.0%, SP 89.1%; dermoscopy: SE 83.7%, SP 87.4%; PCP: SE 81.4%, SP 80.1%. Fitzpatrick mostly I–III; 5 of included studies reported Fitzpatrick type. Provides the specialist dermatologist ceiling benchmark against which non-specialist AI-aided performance can be compared.

Barata et al. 2023 (specialist RL dermoscopy): RL model improved melanoma SE from 61.4% to 79.5% (95% CI 73.5–85.6%); BCC SE improved to 87.1% (95% CI 80.3–93.9%) in 89-dermatologist reader study. Specialist-setting only; not applicable to non-specialist T4 context but relevant to the overall dermatology AI SotA. Dataset diversity (HAM10000 + patient-centered subset from Portugal/Argentina) noted.

Answer​

New downstream edits triggered by T7:

  1. IHS4 ICC (T5): Add Goldfarb 2021 as a second supporting reference alongside Wiala 2024, confirming that IHS4 inter-rater ICC in published clinical studies ranges from 0.47 to >0.75 depending on rater experience and training, and that 0.727 falls within the upper-moderate/approaching-high range.
  2. Fitzpatrick T2 evidence (T2/T8): Add Tepedino 2024 to Option A evidence — provides in-practice primary care data with 27.1% Fitzpatrick V patients; device specificity 69.1% in Fitzpatrick IV–VI vs 53.2% in I–III.
  3. T4 SotA reinforcement: Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 all reinforce the non-specialist and teledermatology accuracy benchmarks. They can be added to the SotA NMSC section as secondary supporting evidence alongside the 4 primary T4 papers.

No new pediatric, autoimmune, or severity data found across the 10 reviewed articles (beyond Goldfarb 2021 for severity/ICC). T9–T12 searches remain necessary for those dimensions.


T8: Literature search B1 — Fitzpatrick V–VI AI dermatology​

Status: ✅ Done — 21 results screened; 9 papers included

Purpose: Determine whether T2 can follow Option A (cite external evidence) or must follow Option B (§6.5(e) declaration).

Search executed: 2026-04-10. PubMed, 21 results. Filters: Free full text, full text, English, Humans.

Eligibility screening (21 results)​

#ReferenceSkin tone data?Quantitative metrics?Eligible?Notes
1–7GBD 2021/2023 epidemiology studies (Lancet, JACC)❌❌❌ NoGBD studies — matched on "sub-Saharan" / "machine learning" for forecasting; wrong domain
8Menzies et al. 2023 — Lancet Digit Health❌❌❌ NoStudy explicitly restricted to Fitzpatrick I–III; IV–VI excluded
9Kim et al. 2025 — Sci RepFitzpatrick III–IV onlyLimited❌ NoKorean Demodex study; Fitzpatrick III–IV (not V–VI); notes need for diverse population validation
10Mathur et al. 2021 — Dermatol TherQualitative only❌❌ NoCNN for COVID-19 skin lesions; mentions "robust on skin of color" but no V–VI specific metrics
11Benčević et al. 2024 — Comput Methods Programs BiomedAll Fitzpatrick types✅✅ YesQuantifies skin color bias in lesion segmentation; uses Fitzpatrick estimation across datasets
12Aggarwal & Papay 2022 — J Dermatolog TreatBrown skin tone✅✅ YesAI for BCC/melanoma in racially diverse populations; reports sensitivity/specificity/AUC for brown skin
13Groh et al. 2024 — Nat MedDark vs. light skin✅✅ Yes389 dermatologists + 459 PCPs; 46 diseases; 4 pp accuracy gap for dark skin; AI improves accuracy but can exacerbate gap
14Han et al. 2022 — J Invest DermatolFitzpatrick III–IV only✅❌ NoKorean RCT; Fitzpatrick III–IV (not V–VI); AI augmentation validated in Asian skin but not dark skin specifically
15Pan et al. 2025 — Sci Rep❌❌❌ NoML for NMSC epidemiological burden forecasting — not AI diagnostic validation in dark skin
16Liu et al. 2023 — DermatologySkin of color (Fitzpatrick IV–VI)✅✅ YesSystematic review specifically on AI for pigmented lesions in skin of color; 22 studies reviewed
17Kamulegeya et al. 2023 — Afr Health SciFitzpatrick 6 (Uganda)✅✅ YesAI on Fitzpatrick 6 in Uganda; diagnostic accuracy 17% vs 69.9% (Caucasian); severe performance gap documented
18Flament et al. 2023 — Skin Res TechnolSouth African men (dark skin)✅✅ YesAutomatic AI grading of facial signs in dark-skinned population; correlations 0.59–0.95 vs. dermatologists
19Walker et al. 2025 — OncologyFitzpatrick I–III vs IV–VI✅✅ YesDirect comparison: AUC 0.858 (I–III) vs 0.856 (IV–VI), p = NS; no significant performance difference
20Tjiu & Lu 2025 — Medicina (Meta-Analysis)Fitzpatrick I–III vs IV–VI✅✅ YesMeta-analysis 18 studies: AUROC 0.89 (I–III) vs 0.82 (IV–VI); persistent fairness gap documented
21Dulmage et al. 2021 — J Invest DermatolFitzpatrick I–III vs IV–VI✅✅ YesPoint-of-care AI wide range skin diseases; accuracy 70% (I–III) vs 68% (IV–VI), p = 0.79 (NS)

CRIT1-7 scoring — included papers​

CriterionBenčević 2024Aggarwal 2022Groh 2024Liu 2023Kamulegeya 2023Flament 2023Walker 2025Tjiu 2025Dulmage 2021
CRIT1 (similar device or benchmark)122221222
CRIT2 (clinical setting)002121221
CRIT3 (population representativeness)111222221
Relevance subtotal2/63/65/65/66/64/66/66/64/6
CRIT4 (level of evidence ≥ 4)111111111
CRIT5 (quantitative performance data)111111111
CRIT6 (clinical significance)011111111
CRIT7 (statistical analysis)111001111
Quality subtotal3/44/44/43/43/44/44/44/44/4
Total weight5/107/109/108/109/108/1010/1010/108/10
Include?✅ Yes✅ Yes✅ Yes✅ Yes✅ Yes✅ Yes✅ Yes✅ Yes✅ Yes

CRIT2 note for Benčević and Aggarwal: Score 0 because both are computational/algorithm studies using public image datasets, not clinical deployment studies.

CRIT3 note for Groh and Dulmage: Score 1 because the populations do not fully map to the device's target patient population (Groh uses 46-condition test sets in a simulated teledermatology scenario; Dulmage uses a test bank of images for morphology classification).

CRIT7 note for Liu and Kamulegeya: Score 0 because neither provides inferential statistical comparisons for the skin-tone performance metrics specifically (Liu is a systematic review with quality review; Kamulegeya reports only descriptive accuracy).

Key data — highest-priority papers​

Tjiu & Lu 2025 — Medicina (10/10) — Meta-Analysis​

Full citation: J.W. Tjiu, C.F. Lu. "Equity and Generalizability of Artificial Intelligence for Skin-Lesion Diagnosis Using Clinical, Dermoscopic, and Smartphone Images: A Systematic Review and Meta-Analysis." Medicina (Kaunas). 2025;61(12):2186. DOI: 10.3390/medicina61122186.

Design: Systematic review and meta-analysis; 18 studies (11 melanoma, 7 mixed benign-malignant); PubMed/Embase/Web of Science/ClinicalTrials.gov; QUADAS-2 risk of bias; GRADE evidence certainty.

Key results:

  • Pooled sensitivity: 0.91 (95% CI 0.74–0.97); pooled specificity: 0.64 (95% CI 0.47–0.78)
  • HSROC AUROC overall: 0.88 (95% CI 0.84–0.92)
  • AUROC by skin tone: 0.82 (Fitzpatrick IV–VI) vs 0.89 (Fitzpatrick I–III) — fairness gap documented
  • Performance by setting: specialist 0.90, community care 0.85, smartphone 0.81
  • Conclusion: "AI-based dermatology systems achieve high diagnostic accuracy but demonstrate reduced performance in darker skin tones"

T8 significance: Strongest available evidence quantifying the AI skin-tone performance gap. AUROC 0.82 for Fitzpatrick IV–VI is still clinically relevant (above the 0.80 SotA benchmark for malignancy detection), but lower than light skin. This gap is field-wide and supports the §6.5(e) acceptable gap argument.

Walker et al. 2025 — Oncology (10/10)​

Full citation: B.N. Walker, T.W. Blalock, R. Leibowitz, Y. Oron, D. Dascalu, E.O. David, A. Dascalu. "Skin Cancer Detection in Diverse Skin Tones by Machine Learning Combining Audio and Visual Convolutional Neural Networks." Oncology. 2025;103(5):413–420. DOI: 10.1159/000541573.

Design: Retrospective; 60 Fitzpatrick I–III vs. 72 Fitzpatrick IV–VI biopsy-validated smartphone images; dual audio-visual CNN (sonification-aided); malignant vs. benign dichotomous output.

Key results:

  • AUC: 0.858 (I–III, 95% CI 0.795–0.921) vs 0.856 (IV–VI, 95% CI 0.759–0.953), p = NS
  • Sensitivity: 84.4% (71.8–96.9) vs 79.6% (63.4–93.8), p = NS
  • Specificity: 84.2% (72.6–95.8) vs 85.3% (73.4–97.2), p = NS
  • Accuracy: 0.817 (I–III) vs 0.847 (IV–VI) — comparable

T8 significance: Strongest available evidence that a well-designed AI skin cancer detection tool achieves statistically equivalent performance across Fitzpatrick I–III and IV–VI. Supports Option A for T2. Limitation: small sample per group; single AI architecture (sonification method).

Groh et al. 2024 — Nat Med (9/10)​

Full citation: M. Groh, O. Badri, R. Daneshjou, A. Koochek, C. Harris, L.R. Soenksen, P.M. Doraiswamy, R. Picard. "Deep learning-aided decision support for diagnosis of skin disease across skin tones." Nat Med. 2024;30(2):573–583. DOI: 10.1038/s41591-023-02728-3.

Design: Large-scale digital experiment; 389 board-certified dermatologists + 459 PCPs from 39 countries; 364 images, 46 skin diseases; store-and-forward teledermatology simulation.

Key results:

  • Specialist accuracy: 38%; generalist accuracy: 19%
  • Both specialists AND generalists were 4 pp less accurate for dark skin images (human baseline gap)
  • Fair deep learning AI improved both specialists and generalists by >33% overall
  • BUT AI exacerbated the accuracy gap across skin tones for generalists

T8 significance: Key finding — the skin-tone accuracy gap is a human (not just AI) limitation. AI-aided diagnosis improves overall accuracy but does not necessarily close the skin-tone gap. This contextualises the device's phototype limitation as consistent with the SotA challenge. The 4 pp gap in specialist performance across skin tones is the most relevant human benchmark.

Liu et al. 2023 — Dermatology (8/10) — Systematic Review​

Full citation: Y. Liu, C.A. Primiero, V. Kulkarni, H.P. Soyer, B. Betz-Stablein. "Artificial Intelligence for the Classification of Pigmented Skin Lesions in Populations with Skin of Color: A Systematic Review." Dermatology. 2023;239(4):499–513. DOI: 10.1159/000530225.

Design: Systematic review; 22 eligible articles; only studies with ≥ 10% skin-of-color images in training data.

Key results:

  • Majority from East Asian populations (Chinese 7/22, Korean 5/22, Japanese 3/22)
  • Only 7 studies included Fitzpatrick IV–VI or diverse datasets
  • Binary outcome accuracy: 70–99.7%; multiclass accuracy: 43–93%
  • "Insufficient evidence to comment on the overall accuracy of AI models for darker skin types" (Fitzpatrick V–VI specifically)

T8 significance: The field-wide evidence gap for Fitzpatrick V–VI is confirmed — the SotA itself lacks adequate representation. This is the strongest argument for the §6.5(e) acceptable gap: the device's limitation mirrors the state of the art. Cannot be attributed to a device-specific failure.

Kamulegeya et al. 2023 — Afr Health Sci (9/10)​

Full citation: L. Kamulegeya, J. Bwanika, M. Okello, D. Rusoke, F. Nassiwa, W. Lubega, D. Musinguzi, A. Börve. "Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning." Afr Health Sci. 2023;23(2):753–763. DOI: 10.4314/ahs.v23i2.86.

Design: Retrospective; 123 images from Ugandan telehealth database; Fitzpatrick 6 (dark skin); tests Skin Image Search AI app.

Key results:

  • Overall AI accuracy on Fitzpatrick 6: 17% (21/123)
  • Reported training performance (Caucasian): 69.9%
  • Performance by condition: dermatitis best (80%); most conditions very low
  • "Need for diversity of image datasets used to train dermatology algorithms"

T8 significance: Documents the worst-case performance gap for an untrained/undertrained AI on dark skin. The 17% accuracy is for a generic consumer AI app not specifically trained on dark skin — useful as evidence that skin-tone performance gap is a recognised SotA challenge. The device's ViT-based approach specifically trained on a multi-ethnic dataset is architecturally different. Use to contextualise the field-wide challenge.

Dulmage et al. 2021 — J Invest Dermatol (8/10)​

Full citation: B. Dulmage, K. Tegtmeyer, M.Z. Zhang, M. Colavincenzo, S. Xu. "A Point-of-Care, Real-Time Artificial Intelligence System to Support Clinician Diagnosis of a Wide Range of Skin Diseases." J Invest Dermatol. 2021;141(5):1230–1235. DOI: 10.1016/j.jid.2020.08.027.

Design: Point-of-care AI tested on 222 images of heterogeneous Fitzpatrick types.

Key results:

  • Overall AI accuracy: 68%
  • Fitzpatrick I–III: 70%; Fitzpatrick IV–VI: 68%, p = 0.79 (NS)

T8 significance: No statistically significant difference between AI accuracy in Fitzpatrick I–III and IV–VI for a wide-range skin disease AI (p = 0.79). Supports Option A. Limitation: single study, 222 images.

Answer — T2 decision triggered​

T2 decision: Hybrid approach (Option A + B combined)

The evidence from T8 is mixed and supports neither a pure Option A nor a pure Option B approach:

For Option A (cite external evidence showing adequate V–VI performance):

  • Walker 2025: AUC 0.856 vs 0.858 (p = NS) — no significant difference in skin cancer detection across Fitzpatrick groups
  • Dulmage 2021: 68% vs 70% accuracy (p = 0.79) — no significant difference for wide-range skin disease diagnosis

For Option B (§6.5(e) acceptable gap declaration):

  • Tjiu 2025 meta-analysis: AUROC 0.82 (IV–VI) vs 0.89 (I–III) — persistent 7-point gap across the field
  • Liu 2023 systematic review: insufficient evidence for Fitzpatrick V–VI across the entire SotA — the field itself lacks adequate data
  • Kamulegeya 2023: 17% vs 69.9% accuracy for untrained AI on Fitzpatrick 6 — confirms data gap and training dependency
  • Groh 2024: both human specialists AND AI are 4 pp less accurate in dark skin — gap is not device-specific

Recommended T2 approach:

  1. Apply Option A: cite Walker 2025 and Dulmage 2021 as SotA evidence that well-designed AI skin tools can achieve comparable performance across Fitzpatrick I–VI
  2. Apply Option B simultaneously: add a formal §6.5(e) acceptable gap declaration noting that the field-wide evidence for Fitzpatrick V–VI is insufficient (Liu 2023), the gap is SotA-wide (Tjiu 2025: AUROC 0.82 vs 0.89), and the device's ViT-based architecture handles phototype variation through relative intensity assessment
  3. Cite the device's own ASCORAD_2022 study as internal evidence for Fitzpatrick IV–VI testing (112 images)
  4. Confirm PMCF monitoring commitment for phototype performance stratification

T9: Literature search B2 — Pediatric AI dermatology​

Status: ✅ Done — 26 results screened; 1 paper directly included for T9; 4 papers redirected to T10/T11

Purpose: Contextualise the 6.3% pediatric proportion in the device's clinical evidence base.

Search executed: 2026-04-10. PubMed, 26 results. Filters: Free full text, Full text, English, Humans.

Important search observation: The very low yield of qualifying papers is itself evidence of a field-wide gap. Only 1 of 26 results reported AI diagnostic performance specifically in a pediatric skin disease population — consistent with the general SotA literature showing that pediatric AI dermatology is underrepresented. This supports the §6.5(e) acceptable gap argument for the 6.3% pediatric proportion.

Eligibility screening (26 results)​

#ReferencePediatric skin AI?Eligible?Notes
1Goodman et al. 2023 — JAMA Netw Open❌❌ NoChatGPT accuracy for physician questions — wrong domain
2Wang et al. 2024 — Nat Commun❌❌ NoAI for cervical cytology — wrong anatomical site
3Fetahu et al. 2023 — Nat Commun❌❌ NoSingle-cell transcriptomics in neuroblastoma — cancer biology
4Hu et al. 2025 — Sci Rep❌❌ NoML for melanoma prognosis — prognostic prediction, not pediatric AI diagnosis
5Abràmoff et al. 2022 — Ophthalmology❌❌ NoAI for ophthalmic images — wrong anatomical site
6Wang et al. 2024 — Comput Biol Med❌❌ NoDL for pilomatricoma histopathology WSI — specialist histopathology, not clinical AI diagnosis
7Huang et al. 2024 — Artif Intell Med❌ (T11/T12)❌ No for T9AI eczema severity systematic review — redirected to T11
8Marri et al. 2024 — JMIR Dermatol⚠️ Borderline⚠️ WeakAysa AI app includes patients ≥2 years but no pediatric-specific performance data; redirected to T11
9Yu et al. 2025 — Photodiagnosis Photodyn Ther✅ Yes✅ YesDeep learning vs. dermatologists for childhood vitiligo — 474 pediatric patients; DL AUC 0.91
10Mashoudy et al. 2024 — Arch Dermatol Res❌❌ NoTelemedicine review for skin cancer care — not pediatric-specific
11Lu et al. 2025 — Sci Rep❌❌ NoPediatric T-cell ALL biomarkers — leukemia, not skin
12Jalali-Najafabadi et al. 2021 — Sci Rep❌❌ NoGenetic risk prediction for psoriasis/AS — not diagnostic AI; redirected to T11
13Aung et al. 2025 — JAMA Netw Open❌❌ NoTIL assessment in melanoma — not pediatric-specific
14Huo et al. 2025 — Cell Signal❌❌ NoMolecular pathway in keloid — molecular biology
15Soe et al. 2024 — J Med Internet Res❌❌ NoAI to differentiate mpox — not pediatric-specific
16Lang et al. 2020 — J Invest Dermatol❌❌ NoCiliation index for spitzoid neoplasms — not AI-based
17Ha et al. 2022 — Pediatr Rheumatol❌❌ NoBlood transcriptomics in pediatric rheumatic diseases — no skin AI diagnostic outcome
18Kurugol et al. 2015 — J Invest Dermatol❌❌ NoAutomated DEJ delineation in RCM — technical methodology only
19Benítez-Andrades et al. 2025 — Sci Rep❌❌ NoAccelerometer for balance in schoolchildren — wrong domain
20Yan et al. 2025 — Sci Rep❌❌ NoML for port-wine stain treatment prediction — treatment response, not diagnostic AI
21Maintz et al. 2021 — JAMA Dermatol❌ (T11/T10)❌ No for T9ML deep phenotyping of AD in adolescent/adult — redirected to T11
22Seité et al. 2019 — Exp Dermatol❌ (T10)❌ No for T9AI acne grading from smartphone — adolescent-relevant but not pediatric-specific; redirected to T10
23Zhou et al. 2021 — PLoS One❌❌ NoPredicting psoriasis from lab tests — no imaging AI; redirected to T11
24Lipids Health Dis. 2025❌❌ NoCardiovascular risk prediction — wrong domain
25Au et al. 2023 — Sensors❌ (T12)❌ No for T9Sensorised glove for AD scratching detection — wearable sensor; redirected to T12
26Wang et al. 2025 — Sci Rep❌❌ NoDL for infant fundus photography quality — ophthalmology

CRIT1-7 scoring — included paper​

CriterionYu et al. 2025
CRIT1 (study focus — similar device or clinical practice benchmark)1
CRIT2 (clinical setting — dermatology, device supporting clinicians in skin assessment)2
CRIT3 (population — pediatric, directly relevant to the gap)2
Relevance subtotal5/6
CRIT4 (study design — level of evidence ≥ 4)1
CRIT5 (outcome measurement — quantitative performance data)1
CRIT6 (clinical significance — head-to-head AI vs. dermatologist comparison)1
CRIT7 (statistical analysis — ROC curves, CIs)1
Quality subtotal4/4
Total weight9/10
Include?✅ Yes

CRIT1 note: Scores 1 (not 2) because the study focuses on vitiligo diagnosis, not on the multi-condition AI assessment that characterises the device. The study is functionally adjacent (AI dermoscopy image classification in a clinical dermatology setting) but the specific clinical function differs.

Key data extracted — Yu et al. 2025​

Full citation: S. Yu, Z. Chen, J. He, H. Wang. "Comparative study of dermatologists and deep learning model on diagnosing childhood vitiligo." Photodiagnosis Photodyn Ther. 2025;54:104727. DOI: 10.1016/j.pdpdt.2025.104727.

Study design: Prospective comparative study; 474 pediatric patients (223 vitiligo, 251 controls); three imaging modalities (dermoscopic images, Wood's lamp images, standard clinical photographs); eight dermatologists performed double-blind evaluation; two DL models (ResNet152, DenseNet121) trained on 3,896 dermoscopic images (80/20 train/validation split); China.

DL model performance (dermoscopy):

  • ResNet152: accuracy 83.08%, recall 86.84%, precision 81.08%, specificity 79.22%, F1 0.8386, AUC 0.91
  • DenseNet121: accuracy 81.41%, recall 83.41%, precision 82.03%, specificity 79.12%, F1 0.8271, AUC 0.89

Dermatologist performance (dermoscopy only, 8 clinicians):

  • AUC 0.77 (95% CI 0.51–1.00), sensitivity 0.88 (95% CI 0.53–0.99), specificity 0.75 (95% CI 0.41–0.96)
  • Performance correlated with years of experience

T9 significance: Both DL models outperform dermatologists on the AUC metric (0.91/0.89 vs 0.77) for diagnosis of vitiligo in a purely pediatric population (474 children). This is the only published study we identified with AI achieving high diagnostic accuracy in a dedicated pediatric skin disease population.

Dual T11 relevance: Vitiligo is an autoimmune skin condition — this paper is directly relevant to both T9 (pediatric AI dermatology) and T11 (autoimmune skin disease AI detection). Record under both tasks.

Limitations: Single condition (vitiligo); single dermoscopy modality evaluated in head-to-head; only 8 dermatologists (wide CIs on clinician performance); China single-site; DenseNet121 performance slightly lower than ResNet152.

Papers redirected to other tasks​

Paper 7 — Huang et al. 2024 (Artif Intell Med): Systematic review of AI for eczema severity from digital images (25 studies). Notes that only 28% of studies report patient age range and 16% report skin phototype. Confirms field-wide data quality gap for age-disaggregated AI severity assessment. Redirected to T11 and T12.

Paper 21 — Maintz et al. 2021 (JAMA Dermatol): ML deep phenotyping of AD severity in adolescent and adult patients (n=367; 94% adults, 6% adolescents aged 12–21 years); EASI-based severity stratification; ML gradient-boosting AUC 0.71 (95% CI 0.69–0.72) for severity classification. Redirected to T11 (atopic dermatitis AI severity prediction).

Paper 22 — Seité et al. 2019 (Exp Dermatol): AI algorithm for acne grading from smartphone (1,072 patients; GEA scale; 68% agreement with dermatologists at final algorithm version). Redirected to T10 (AI severity grading — real-world clinical studies).

Paper 25 — Au et al. 2023 (Sensors): Sensorised glove for scratching detection in atopic dermatitis (ML model accuracy 83%–99%; pilot in 6 children). Wearable sensor approach, not visual AI. Redirected to T12 (UAS monitoring — adjacent severity assessment technology).

Answer — use in CER​

Key conclusion: The T9 search found only 1 qualifying paper for pediatric AI dermatology — Yu et al. 2025 (childhood vitiligo, AUC 0.91 for DL). The thin yield across 26 results mirrors the field-wide evidence gap: pediatric AI dermatology diagnosis is a recognised SotA limitation.

CER representativeness section — dual-track argument:

  1. Positive evidence (Option A): Cite Yu et al. 2025 — AI achieves AUC 0.91 in a purely pediatric skin disease population, outperforming dermatologists (AUC 0.77), demonstrating that AI diagnostic tools can generalise to pediatric patients.
  2. Acceptable gap (Option B): The SotA itself lacks adequate pediatric-disaggregated evidence; only 1 of 26 pediatric AI dermatology search results was qualifying. Declare the 6.3% pediatric proportion as a formally justified §6.5(e) acceptable gap, noting field-wide underrepresentation and PMCF monitoring commitment.

T10: Literature search B3 — Severity Pillar 3 real-world clinical studies​

Status: ✅ Done — 4 results screened; 4 papers included (3 primary + 1 contextual); 1 additional paper added from T9 redirect

Purpose: Find published studies using AI severity assessment in real clinical encounters (not atlas images) to partially bridge Gap 2 before PMCF results are available.

Search executed: 2026-04-10. PubMed, 4 results. Filters: Free full text, Full text, English, Humans. Small yield reflects the tight search string (requiring both ICC/agreement AND clinical/prospective AND AI/smartphone AND severity score).

Additional paper added: Seité et al. 2019 (Exp Dermatol) — AI acne grading from smartphone, redirected from T9 screening.

Eligibility screening (4 results + 1 redirect)​

#ReferenceAI/smartphone severity in clinical encounter?ICC or agreement reported?Eligible?Notes
1Schaap et al. 2022 — J Eur Acad Dermatol Venereol✅ CNN automated PASI scoring; clinical images from treating physician encounters✅ ICC 0.58–0.79 per subscore✅ YesPrimary evidence — CNN matches physician; real clinical images
2Maulana et al. 2024 — Narra J⚠️ DL PASI classification, 1,546 "clinical images"⚠️ Cohen's Kappa; no ICC; no clinical encounter matching⚠️ WeakDataset-only development study; no real clinical encounter validation; "further clinical validation required"
3Ali et al. 2022 — Skin Res Technol✅ Smartphone photos taken by patients at home; assessed by dermatologists✅ ICC 0.86–0.90 (photo vs. clinical)✅ YesPrimary evidence — AD (EASI/SCORAD); highest ICC in search
4Ali et al. 2024 — Dermatology⚠️ Consumer self-assessment (not AI); physician assesses photos✅ ICC 0.23 (weak)✅ Yes (contextual)Establishes that unaided patient self-assessment is unreliable; supports AI assistance need
—Seité et al. 2019 — Exp Dermatol✅ AI algorithm for acne GEA grading from smartphone⚠️ 68% agreement (GEA grade); no ICC✅ YesFrom T9 redirect; smartphone AI severity from dermatology clinic

CRIT1-7 scoring — included papers​

CriterionSchaap 2022Ali 2022Ali 2024Seité 2019
CRIT1 (AI/automated severity scoring, similar to device function)2212
CRIT2 (clinical setting — real patients, clinical data)2211
CRIT3 (population representativeness — chronic inflammatory skin disease)2111
Relevance subtotal6/65/63/64/6
CRIT4 (level of evidence ≥ 4 — prospective/validation)1111
CRIT5 (quantitative ICC or agreement metric)1111
CRIT6 (clinical significance — aids treatment monitoring or decisions)1101
CRIT7 (statistical analysis — CIs or inferential statistics)1110
Quality subtotal4/44/43/43/4
Total weight10/109/106/107/10
Include?✅ Yes✅ Yes✅ Yes (contextual)✅ Yes

CRIT2 note for Ali 2024: Score 1 (not 2) because the clinical encounter is indirect — the physician assesses photographs retrospectively from a consumer database (NØIE), not from clinical visits.

CRIT2 note for Seité 2019: Score 1 (not 2) because the study is algorithm development on a curated dataset from a dermatology clinic, not a prospective clinical validation study.

CRIT3 note for Ali 2022: Score 1 (not 2) because the study is restricted to mild-to-moderate AD (EASI ≤21); severe cases excluded. Limits generalisability to full severity range.

CRIT6 note for Ali 2024: Score 0 because the result is negative — weak agreement (ICC 0.23) means the study does NOT support clinical utility of this approach; it demonstrates the insufficiency of unaided self-assessment.

Key data extracted​

Schaap et al. 2022 — J Eur Acad Dermatol Venereol (10/10)​

Full citation: M.J. Schaap, N.J. Cardozo, A. Patel, E.M.G.J. de Jong, B. van Ginneken, M.M.B. Seyger. "Image-based automated Psoriasis Area Severity Index scoring by Convolutional Neural Networks." J Eur Acad Dermatol Venereol. 2022;36(1):68–75. DOI: 10.1111/jdv.17711.

Study design: Retrospective validation; CNN-based automated PASI subscore classification from standardized clinical photographs; real clinical data — images matched to PASI subscores determined by treating physician in clinical practice; N = 576 trunk, 614 arm, 541 leg image series; Netherlands dermatology clinic.

CNN ICC vs. real-life clinical scores (trunk region):

  • Erythema: CNN 0.616 vs physician image-based 0.558
  • Desquamation: CNN 0.580 vs physician image-based 0.589 (physicians marginally better)
  • Induration: CNN 0.580 vs physician image-based 0.573
  • Area: CNN 0.793 vs physician image-based 0.694

Physician image-based PASI ICC (inter-rater, N=5): 0.706–0.793 (moderate-good agreement).

Key finding: CNN performs comparably to or slightly better than trained physicians for image-based PASI scoring. Area scoring is the most reliable domain (ICC 0.793 for CNN). Performance is consistent across trunk, arms, and legs.

Gap 2 significance: Establishes that automated CNN-based PASI scoring from clinical photographs achieves ICC 0.58–0.79 in real clinical data — directly comparable to the AIHS4_2023 ICC of 0.727. Provides the SotA benchmark for AI severity scoring in clinical encounters and shows that CNN performance matches physician performance for the most objective PASI subscores (area, erythema, induration).

Ali et al. 2022 — Skin Res Technol (9/10)​

Full citation: Z. Ali, A. Chiriac, T. Bjerre-Christensen, et al. "Mild to moderate atopic dermatitis severity can be reliably assessed using smartphone-photographs taken by the patient at home: A validation study." Skin Res Technol. 2022;28(2):336–341. DOI: 10.1111/srt.13136.

Study design: Prospective validation; N=79 participants; AD severity evaluated in clinic by two assessors (EASI, SCORAD, IGA); participants photographed lesions at home using own smartphone; photographs assessed twice (8-week interval) by five dermatologists experienced in photographic evaluation; Denmark.

Key ICC results:

  • Clinical EASI vs photographic EASI: ICC 0.88 (95% CI 0.81–0.93)
  • Clinical SCORAD vs photographic SCORAD: ICC 0.86 (95% CI 0.70–0.93)
  • Perfect IGA agreement between clinical and photographic: 62%; never deviating >1 grade
  • Inter-rater ICC for photographic EASI: 0.90 (0.85–0.94)
  • Inter-rater ICC for photographic SCORAD: 0.96 (0.91–0.98)
  • Intra-rater reliability (photographic EASI): 0.95–0.98

Key finding: Excellent agreement (ICC 0.86–0.90) between clinical severity assessment and smartphone-photograph-based assessment for mild-to-moderate AD. The photographic ICC (0.88 for EASI) substantially exceeds the threshold for "good" reliability. Inter-rater reliability of photographic EASI (0.90) is higher than many direct clinical assessments reported in the literature.

Gap 2 significance: The highest ICC data in the search. Establishes the SotA ceiling for smartphone-based severity assessment — ICC up to 0.90 is achievable in real clinical encounters. This is the benchmark against which the device's Pillar 3 PMCF study should be designed.

Ali et al. 2024 — Dermatology (6/10) — contextual gap evidence​

Full citation: Z. Ali, A. Al-Mousawi, B.Þ. Björnsson, A. Egeberg, C. Riemer, S.F. Thomsen. "The Agreement between Consumer-Driven Self-Assessment of Psoriasis Severity and Physician-Assessed Severity Based on Patient-Taken Photographs Is Weak." Dermatology. 2024;240(3):362–368. DOI: 10.1159/000536175.

Study design: Cross-sectional; N=187 psoriasis patients from NØIE consumer database (Denmark, 2009–2022); patient self-assessed severity (0–10 scale converted to 0–4 PASI-equivalent); physician assessed severity from patient-taken smartphone photographs (erythema, induration, scaling 0–4 PASI-equivalent).

ICC results:

  • Overall: ICC 0.23 (95% CI 0.00–0.92) — very weak
  • Chronic patients: ICC 0.34 (95% CI 0.00–0.95)
  • Non-chronic: ICC 0.09 (95% CI -0.01–0.82)
  • Men: ICC 0.53; women: ICC 0.12

Gap 2 significance: Consumer-driven self-assessment without AI is unreliable (ICC 0.23). This paper establishes the lower bound of unaided self-assessment and demonstrates why AI assistance is needed to achieve reliable severity scoring. The contrast with Paper 3 (ICC 0.86–0.90 with physician-assessed photos) and Paper 1 (ICC 0.58–0.79 with CNN) quantifies the benefit of AI/physician involvement over unaided self-assessment.

Seité et al. 2019 — Exp Dermatol (7/10)​

Full citation: S. Seité, A. Khammari, M. Benzaquen, D. Moyal, B. Dréno. "Development and accuracy of an artificial intelligence algorithm for acne grading from smartphone photographs." Exp Dermatol. 2019;28(11):1252–1257. DOI: 10.1111/exd.14022.

Study design: Algorithm development + validation; N=1,072 acne patients; 5,972 smartphone images; GEA (Global Evaluation of Acne) scale; three trained dermatologists provided reference grades; AI algorithm trained and iterated across six versions; France.

Key metric: At final version 6, GEA grading by AI algorithm reached 68% agreement with dermatologist consensus. Algorithm identifies comedonal, inflammatory lesion types and post-inflammatory hyperpigmentation.

Gap 2 significance: Demonstrates AI acne severity grading from smartphones with 68% agreement — a lower ICC-equivalent than AD (Paper 3, ICC 0.88) and PASI (Paper 1, ICC 0.58–0.79). Acne severity AI from smartphones is less mature than PASI/EASI automation. Useful as contextual SotA evidence that severity grading from smartphones is feasible across multiple skin conditions, even if acne is not the device's primary severity function.

Answer — use in CER Gap 2 declaration​

Key conclusion: The search found 4 papers (+ 1 redirect) establishing a clear SotA benchmark for AI/smartphone-based severity scoring in clinical encounters:

PaperConditionScaleICC / AgreementClinical data?
Schaap 2022PsoriasisPASICNN 0.58–0.79 (≈ physician 0.71–0.79)✅ Matched to clinic PASI
Ali 2022Atopic dermatitisEASI / SCORAD0.86–0.90 (photo vs. clinical)✅ Prospective clinic + home
Ali 2024PsoriasisPASI-equivalent0.23 (unaided self-assessment)✅ Consumer RWD
Seité 2019AcneGEA68% agreement✅ Dermatology clinic

CER Gap 2 argument (two-part):

  1. SotA context: Automated/smartphone severity scoring achieves ICC 0.58–0.90 across dermatological conditions in clinical encounters. The device's AIHS4_2023 ICC of 0.727 is within this range — it is not an outlier. The SotA for PASI/EASI automation (ICC 0.58–0.90) provides the benchmark against which Pillar 3 PMCF data will be judged.
  2. Gap justification: No paper in the search validates real-world severity scoring in HS (IHS4), psoriasis (PASI), or urticaria (UAS) in an unselected primary care population with the device architecture used here. The SotA itself confirms that real-world clinical validation of AI severity assessment is still emerging — "further clinical validation and model refinement remain required" (Maulana 2024). This supports the §6.5(e) acceptable gap declaration, with PMCF monitoring as the resolution pathway.

Downstream edit: Add all four papers to the CER Gap 2 section. Update the §6.5(e) declaration to cite this SotA evidence showing that: (a) AI severity scoring in clinical encounters achieves ICC 0.58–0.90 across skin conditions — the device's AIHS4 ICC of 0.727 is within this range; (b) real-world clinical severity validation is a recognised SotA gap, not a device-specific failure.


T11: Literature search C1 — Autoimmune skin disease AI detection​

Status: ✅ Done — 105 results screened; 4 papers included (2 primary, 2 SotA context); field-wide gap confirmed

Purpose: Strengthen Gap 4 acceptable gap justification by showing the SotA itself lacks strong AI evidence for autoimmune visual diagnosis.

Search executed: 2026-04-10. PubMed, 105 results. Filters: Free full text, full text, English, Humans.

Key search observation: The vast majority of papers (>95%) involve genomics, transcriptomics, proteomics, or blood biomarker ML for systemic autoimmune diseases — not clinical skin image AI. Image-based AI papers use specialised modalities (nailfold capillaroscopy, IIF tissue sections, retinal OCTA). Only 1 paper uses clinical skin photographs for a disease panel that includes autoimmune skin conditions.

Eligibility screening — image-based AI papers (subset of 105)​

#ReferenceModalityAutoimmune conditionClinical skin image AI?Eligible?
9Lledó-Ibáñez 2025 — RheumatologyNailfold capillaroscopySSc❌ Specialised⚠️ SotA context only
20Bharathi 2023 — RheumatologyNailfold capillaroscopySSc❌ Specialised✅ Yes (SotA context — 7/10)
36Hocke 2023IIF tissue sectionsAIBD, pemphigus❌ Lab immunofluorescence❌ No — laboratory test
40Garaiman 2023 — RheumatologyNailfold capillaroscopySSc❌ Specialised✅ Yes (SotA context — 6/10)
97Li 2026 — RMD OpenNailfold capillaroscopySSc/SLE/RA❌ Specialised⚠️ SotA context only
100Mathur 2021 — Dermatol TherClinical skin photographsBP, urticaria (among 20 conditions)✅ Yes✅ Yes (9/10)
T9 refYu 2025 — Photodiagnosis Photodyn TherClinical dermoscopyVitiligo (autoimmune)✅ Yes✅ Yes (9/10, scored in T9)
T9 refHuang 2024 — Artif Intell MedDigital skin photographsAD (inflammatory)✅ Yes⚠️ Field-wide gap context
T9 refMaintz 2021 — JAMA DermatolClinical data + imagesAD (inflammatory)⚠️ Partial⚠️ Field-wide gap context

Remaining 96 papers (not tabulated individually): All involve genomics, transcriptomics, proteomics, NLP, or biomarker ML for systemic autoimmune diseases (SLE n≈50, SSc n≈20, RA/other n≈26). No clinical skin photography AI.

CRIT1-7 scoring — included papers​

CriterionMathur 2021Yu 2025 (vitiligo)Bharathi 2023 (nailfold)Garaiman 2023 (nailfold)
CRIT1 (image AI for skin conditions including autoimmune)2211
CRIT2 (clinical setting — supporting HCPs in skin assessment)2211
CRIT3 (includes autoimmune skin conditions)1211
Relevance subtotal5/66/63/63/6
CRIT4 (level of evidence ≥ 4)1111
CRIT5 (quantitative performance data — AUC, accuracy)1111
CRIT6 (clinical significance — head-to-head or workflow impact)1111
CRIT7 (CIs or inferential statistics)1110
Quality subtotal4/44/44/43/4
Total weight9/109/107/106/10
Include?✅ Yes✅ Yes✅ Yes (SotA context)✅ Yes (SotA context)

CRIT1–2 note for nailfold papers: Score 1 (not 2) because nailfold capillaroscopy is a specialised imaging tool, not clinical skin photography. The AI function is analogous but the modality and pathway differ.

CRIT3 note for Mathur 2021: Score 1 (not 2) because the study targets COVID-19 cutaneous manifestations primarily; BP and urticaria appear within a 20-condition panel.

Key data extracted​

Mathur et al. 2021 — Dermatol Ther (9/10)​

Full citation: P. Mathur, B.D. Srivastava, P. Mathur, et al. "Artificial Intelligence-Based Classification of Multiple Skin Lesions Including Autoimmune Cutaneous Manifestations of COVID-19 Using Convolutional Neural Networks." Dermatol Ther. 2021;34(2):e14791. DOI: 10.1111/dth.14791.

Study design: CNN ensemble (EfficientNet-B3, ResNet50, VGG19) training and validation on clinical skin images; 20 conditions including bullous pemphigoid and urticaria; tested for performance on skin of color.

Key results:

  • Top-1 accuracy (ensemble): 86.7%
  • COVID-19 rash AUC: 0.97
  • Per-condition sensitivity/specificity reported
  • Robust on skin of color — explicitly validated

Gap 4 significance: Only paper in 105 search results applying clinical skin photograph CNN to a panel including autoimmune skin conditions (BP, urticaria). AUC 0.97 for the primary condition; 86.7% top-1 accuracy across 20 conditions. Provides the SotA benchmark for this indication. No independent replication in autoimmune-specific populations exists.

Yu et al. 2025 — Photodiagnosis Photodyn Ther (9/10) [cross-reference from T9]​

Vitiligo is an autoimmune skin condition. DL model (ResNet152) AUC 0.91 in 474 pediatric vitiligo patients, outperforming dermatologists (AUC 0.77). Full data under T9.

Gap 4 significance: Together with Mathur 2021, these two papers form the totality of direct SotA evidence for clinical skin image AI applied to autoimmune conditions.

Bharathi et al. 2023 — Rheumatology (7/10) — SotA context​

Nailfold capillaroscopy DL for SSc. AUC 97% (94–99%), sensitivity/specificity 91% (86–95%). SSc expert consensus: sensitivity 82%, specificity 73%. AI outperforms experts.

Gap 4 significance: Demonstrates AI can achieve very high accuracy for autoimmune connective tissue disease from skin-surface images — establishing a comparable approach even if the modality differs.

Garaiman et al. 2023 — Rheumatology (6/10) — SotA context​

ViT-based nailfold capillaroscopy for SSc. AUC 81.8–84.5%. Same ViT architecture family as the device. One of four rheumatologists performed at or below ViT level.

Gap 4 significance: ViT-based AI (same architecture as the device) achieves 81.8–84.5% AUC for autoimmune condition detection from skin-surface images.

Answer — use in CER Gap 4 declaration​

Key conclusion: The T11 search found only 2 papers directly relevant to clinical skin image AI for autoimmune conditions — Mathur 2021 and Yu 2025. The thin yield across 105 results confirms a field-wide gap: clinical skin image AI validated for autoimmune conditions is extremely limited, not device-specific.

CER Gap 4 — §6.5(e) acceptable gap argument:

  1. Positive evidence: Mathur 2021 (CNN 86.7% top-1 for 20-condition panel including BP/urticaria) and Yu 2025 (AUC 0.91 for vitiligo) show AI can achieve promising accuracy on autoimmune skin conditions.
  2. Acceptable gap justification: Only 2 qualifying clinical skin image AI papers across 105 dedicated search results. The gap is SotA-wide, not addressable through literature review alone.
  3. PMCF commitment: Prospective monitoring of performance in bullous pemphigoid and urticaria subgroups.

Downstream edit: Update the CER Gap 4 §6.5(e) declaration to cite Mathur 2021 (CNN 86.7% for 20-condition panel) and Yu 2025 (AUC 0.91 for vitiligo) as the only SotA benchmarks; note that only 2 qualifying papers exist across 105 search results, formally confirming the field-wide gap.


T12: Literature search C2 — UAS inter-rater benchmarks​

Status: ✅ Done — 27 results screened; 2 papers with applicable UAS reliability data; clinician inter-rater benchmarks absent from field

Purpose: Contextualise the barely-met Krippendorff α = 0.603 for UAS severity.

Search executed: 2026-04-10. PubMed, 27 results. Filters: Free full text, full text, English, Humans.

Note on search string: "UAS" matched multiple unrelated acronyms — urinalysis system, unmanned aircraft system, ureteral access sheath, motor assessment scale, unprotected anal sex. Only 9 of 27 papers were urticaria-related; 2 contained UAS agreement/reliability data.

Eligibility screening — urticaria-relevant papers (9 of 27)​

#ReferenceUAS agreement data?Eligible?Notes
1Tuchinda 2022 — Thai 5-D itch scale⚠️ Contextual⚠️ WeakICC 0.90 for 5-D itch scale; UAS7 used as anchor
3Schnarkowski 2025 — CholUAS⚠️ Contextual⚠️ WeakCholinergic urticaria-specific subtype score
4Khoshkhui 2021 — Persian UCT❌ No❌ NoUCT reliability (Cronbach α 0.68); no UAS scoring data
9Grekowitz 2025 — ColdUAS⚠️ Contextual⚠️ WeakCold urticaria-specific subtype score
10Kocatürk 2012 — Turkish CU-Q2oL❌ No❌ NoQuality of life instrument; UAS peripheral
14Kulthanan 2016 — Thai CU-Q2oL❌ No❌ NoQuality of life MCID study; UAS as anchor
17Tavakol 2014 — Persian CU-Q2oL❌ No❌ NoQuality of life validation; UAS7 correlation peripheral
18Hollis 2018 — Am J Clin Dermatol✅ Yes✅ Yes (10/10)Weighted kappa 0.78–0.82 for UAS7 version comparison; n=614
22Jauregui 2019 — Health Qual Life Outcomes✅ Yes✅ Yes (9/10)Test-retest ICC 0.84; Cronbach α 0.83; n=166

T9 redirect — Au 2023 (Sensors): Sensorised glove for AD scratching detection; ML accuracy 83–99%. Wearable sensor, no UAS scoring agreement data. Not relevant.

CRIT1-7 scoring — included papers​

CriterionHollis 2018Jauregui 2019
CRIT1 (UAS agreement/reliability data applicable to benchmarking α = 0.603)22
CRIT2 (clinical setting — urticaria patients in clinical/trial context)22
CRIT3 (population — chronic spontaneous urticaria, same condition as device use case)22
Relevance subtotal6/66/6
CRIT4 (level of evidence ≥ 4)11
CRIT5 (quantitative agreement metric — kappa or ICC)11
CRIT6 (validates UAS as instrument for clinical use)11
CRIT7 (95% CIs reported)10
Quality subtotal4/43/4
Total weight10/109/10
Include?✅ Yes✅ Yes

CRIT7 note for Jauregui 2019: ICC = 0.84 reported without 95% CI in the abstract; scores 0.

Key data extracted​

Hollis et al. 2018 — Am J Clin Dermatol (10/10)​

Full citation: K. Hollis, C. Proctor, D. McBride, et al. "Comparison of Urticaria Activity Score Over 7 Days (UAS7) Values Obtained from Once-Daily and Twice-Daily Versions: Results from the ASSURE-CSU Study." Am J Clin Dermatol. 2018;19(2):267–274. DOI: 10.1007/s40257-017-0331-8.

Study design: ASSURE-CSU study data; N=614 CSU patients; twice-daily UAS7 (TD) vs once-daily UAS7-max (OD1MAX) and once-daily UAS7-average (OD2AVG); 5 severity score bands (0, 1–6, 7–15, 16–27, 28–42).

Key agreement results:

  • UAS7-TD vs UAS7-OD1MAX: weighted kappa κ = 0.78 (95% CI 0.75–0.82) — "substantial agreement"
  • UAS7-TD vs UAS7-OD2AVG: weighted kappa κ = 0.82 (95% CI 0.78–0.85) — "substantial agreement"
  • Pearson correlations: 0.94–0.99 across all version pairs

Benchmarking significance: Even in a large controlled trial, different UAS completion protocols yield κ = 0.78–0.82. This is the expected range for UAS scoring consistency in clinical contexts.

Jauregui et al. 2019 — Health Qual Life Outcomes (9/10)​

Full citation: I. Jauregui, A. Gimenez-Arnau, J. Bartra, et al. "Psychometric properties of the Spanish version of the once-daily Urticaria Activity Score (UAS) in patients with chronic spontaneous urticaria managed in clinical practice (the EVALUAS study)." Health Qual Life Outcomes. 2019;17(1):23. DOI: 10.1186/s12955-019-1087-z.

Study design: Observational prospective; N=166 CSU patients; Spanish UAS7 completed on 7 consecutive days at two visits 6 weeks apart; test-retest reliability assessed.

Key reliability results:

  • Internal consistency: Cronbach α = 0.83
  • Test-retest reliability: ICC = 0.84
  • Minimal important difference (MID): 7–8 points (0–42 scale)

Benchmarking significance: Test-retest ICC = 0.84 for the same patient under stable conditions — represents the ceiling of reproducibility for patient self-completed UAS7.

Answer — use in CER UAS severity section​

Key conclusion: No published study reports clinician-to-clinician inter-rater agreement for UAS. The UAS is a patient-reported outcome; reliability data reflect patient self-consistency (κ = 0.78–0.82, Hollis 2018; ICC 0.84, Jauregui 2019).

CER UAS α contextualisation:

  1. The device's Krippendorff α = 0.603 for UAS severity classification is moderate agreement, approaching the Landis and Koch "substantial" threshold (0.61–0.80).
  2. Published patient self-consistency for UAS7 under controlled conditions: κ = 0.78–0.82 (Hollis 2018). Real-world variability would be higher.
  3. The device scores a patient-reported outcome from clinical photographs — an intrinsically harder task than patient self-report. α = 0.603 is a reasonable baseline for this novel modality.
  4. No published clinician inter-rater UAS benchmark exists; α = 0.603 cannot be compared to an established clinical standard. It should be framed as a PMCF monitoring baseline, not a pass/fail criterion.

Downstream edit: Update the CER UAS severity section to cite Hollis 2018 (κ = 0.78–0.82 for UAS7 version consistency) and Jauregui 2019 (ICC = 0.84 for test-retest) as context; note that no clinician inter-rater UAS benchmark exists in the published literature; frame α = 0.603 as a PMCF baseline with trajectory monitoring commitment.


Downstream edits triggered​

Once all tasks are complete, the following documents will require edits. This table is updated as answers are finalised.

CER edits (R-TF-015-003)​

SectionTriggered byStatus
Line 818 — melanoma AUC criterion (replace >= 0.80 with >= 0.85; cite MC_EVCDAO_2019 as sole melanoma study; cross-ref 91.99% aggregate malignancy AUC)T1⬜ Ready to edit
§6.5(e) acceptable gap — Fitzpatrick V–VI (hybrid Option A: Walker 2025, Dulmage 2021, Tepedino 2024; Option B: Liu 2023, Tjiu 2025, ASCORAD_2022, ViT)T2, T8, T7⬜ Ready to edit
Lines 1833–1834 — alopecia dermatologist sub-criteria (primary endpoint = pooled HCP met; dermatologist sub-analysis exploratory; range restriction artefact)T3⬜ Ready to edit
CER NMSC section + acceptance criteria derivation table — add Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 as secondary SotA evidenceT4, T7⬜ Ready to edit
IHS4 ICC justification — cite Wiala 2024 + Goldfarb 2021 (IHS4 ICC range 0.47–>0.75 across studies; 0.727 within upper-moderate/high range)T5, T7⬜ Ready to edit
COVIDX utility criterion — cite Roca 2022, Mostafa 2022, Romero-Jimenez 2022 as SotA benchmarks for Clinical Utility Score >= 8T6⬜ Ready to edit
Representativeness — phototype section (hybrid Option A + B; cite Tepedino 2024, Walker 2025, Dulmage 2021 + §6.5(e) Liu 2023, Tjiu 2025)T8, T7⬜ Ready to edit
Representativeness — pediatric section (hybrid: cite Yu 2025 AUC 0.91 + §6.5(e) field-wide gap; only 1 of 26 T9 papers qualified; PMCF commitment)T9⬜ Ready to edit
Gap 2 §6.5(e) declaration — severity Pillar 3 (cite Schaap 2022 ICC 0.58–0.79, Ali 2022 ICC 0.86–0.90 as SotA; 0.727 within range; §6.5(e) gap justified)T10⬜ Ready to edit
Gap 4 §6.5(e) declaration — autoimmune diseases (cite Mathur 2021 CNN 86.7% + Yu 2025 AUC 0.91; 2/105 results confirms field-wide gap)T11⬜ Ready to edit
UAS severity α contextualisation — cite Hollis 2018 (κ = 0.78–0.82) + Jauregui 2019 (ICC = 0.84) as context; frame α = 0.603 as PMCF baselineT12⬜ Ready to edit

SotA edits (R-TF-015-011)​

SectionTriggered byStatus
NMSC malignancy section — add Jones 2022, Jaklitsch 2023, Ferris 2025, Walton 2026 (T4) + Krakowski 2024, Marsden 2024, Sangers 2022, Gregor 2023, Chen 2024 (T7)T4, T7⬜ Ready to edit
IHS4 severity section — add Wiala 2024 (AUC 0.84–0.89; expert inter-rater ICC 0.68–0.78 as benchmark)T5⬜ Ready to edit
Teledermatology utility section — add Roca 2022, Mostafa 2022, Romero-Jimenez 2022 as benchmarks for Clinical Utility Score acceptance criterionT6⬜ Ready to edit
Phototype / skin diversity section — add Walker 2025, Dulmage 2021, Tepedino 2024 (no significant V–VI gap) + Tjiu 2025, Liu 2023, Groh 2024 (field-wide gap context)T8, T7⬜ Ready to edit
Pediatric AI dermatology section — add Yu 2025 (childhood vitiligo AUC 0.91, outperforms dermatologists AUC 0.77); note field-wide gap (1 qualifying paper in 26)T9⬜ Ready to edit
AI severity assessment section (non-IHS4) — add Schaap 2022 (PASI CNN, ICC 0.58–0.79), Ali 2022 (EASI/SCORAD smartphone, ICC 0.86–0.90), Seité 2019 (acne GEA AI, 68% agreement)T10⬜ Ready to edit
Autoimmune skin disease AI section — add Mathur 2021 (BP/lupus CNN 86.7%), Yu 2025 (vitiligo AUC 0.91); note field-wide gap (only 2 qualifying papers in 105-paper search)T11⬜ Ready to edit

BSI response edits​

SectionTriggered byStatus
Item 5 — NMSC_2025 appraisal (cite Jones 2022, Jaklitsch 2023, Ferris 2025, Walton 2026 as primary care context; explain 80% malignancy prevalence in NMSC_2025 as H&N surgery clinic; contextualise with SotA)T4⬜ Ready to edit

CEP edits (R-TF-015-001)​

SectionTriggered byStatus
PMCF commitment — Fitzpatrick phototype stratification (add commitment to report performance by Fitzpatrick group; cross-ref §6.5(e) acceptable gap declaration in CER)T2, T8⬜ Ready to edit
PMCF commitment — pediatric proportion monitoring (add commitment to track pediatric case proportion; cross-ref §6.5(e) gap; baseline: 6.3% of current evidence base)T9⬜ Ready to edit
PMCF commitment — autoimmune disease monitoring (add commitment to monitor autoimmune disease case outcomes; cross-ref Gap 4 §6.5(e) in CER; baseline: 3% of current use cases)T11⬜ Ready to edit
PMCF commitment — UAS severity baseline (add commitment to track device UAS severity agreement trajectory; baseline: Krippendorff α = 0.603; reference ceiling: Hollis 2018 κ = 0.78–0.82)T12⬜ Ready to edit
Previous
Response
Next
Information_for_answers
  • Task tracker
  • T1: Fix melanoma criterion inconsistency
    • Edit instructions — CER line 818
    • Answer
  • T2: Formally declare Fitzpatrick V–VI as acceptable gap
    • Decision summary
    • Edit instructions — CER §6.5(e) section (around line 1951)
    • Answer
  • T3: Strengthen alopecia dermatologist sub-criteria justification
    • Context
    • Edit instructions — CER lines 1833–1834
    • Answer
  • T4: Literature search A1 — BCC/cSCC AI in non-specialist settings
    • Eligibility screening (15 results)
    • CRIT1-7 scoring — included papers
    • Key data extracted
      • Jones et al. 2022 — Lancet Digit Health (8/10)
      • Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)
      • Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study
      • Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)
    • Answer — use in CER and SotA
  • T5: Literature search A2 — IHS4 AI independent validation
    • Eligibility screening
    • CRIT1-7 scoring — Wiala et al. 2024
    • Key data extracted — Wiala et al. 2024
    • Answer — use in CER and SotA
  • T6: Literature search A3 — Teledermatology utility scale benchmarks
    • Eligibility screening (12 results)
    • CRIT1-7 scoring — included papers
    • Key data extracted
      • Roca et al. 2022 — Int J Environ Res Public Health (8/10)
      • Mostafa & Hegazy 2022 — J Dermatolog Treat (7/10)
      • Romero-Jimenez et al. 2022 — Front Immunol (6/10)
    • Answer — use in CER and SotA
  • T7: Re-examine existing high-weight SotA articles
    • Summary screening
    • New findings — Goldfarb et al. 2021 (IHS4 ICC benchmark)
    • New findings — Tepedino et al. 2024 (Fitzpatrick V data)
    • Reinforcement of T4 findings — additional BCC/cSCC non-specialist benchmarks
    • Answer
  • T8: Literature search B1 — Fitzpatrick V–VI AI dermatology
    • Eligibility screening (21 results)
    • CRIT1-7 scoring — included papers
    • Key data — highest-priority papers
      • Tjiu & Lu 2025 — Medicina (10/10) — Meta-Analysis
      • Walker et al. 2025 — Oncology (10/10)
      • Groh et al. 2024 — Nat Med (9/10)
      • Liu et al. 2023 — Dermatology (8/10) — Systematic Review
      • Kamulegeya et al. 2023 — Afr Health Sci (9/10)
      • Dulmage et al. 2021 — J Invest Dermatol (8/10)
    • Answer — T2 decision triggered
  • T9: Literature search B2 — Pediatric AI dermatology
    • Eligibility screening (26 results)
    • CRIT1-7 scoring — included paper
    • Key data extracted — Yu et al. 2025
    • Papers redirected to other tasks
    • Answer — use in CER
  • T10: Literature search B3 — Severity Pillar 3 real-world clinical studies
    • Eligibility screening (4 results + 1 redirect)
    • CRIT1-7 scoring — included papers
    • Key data extracted
      • Schaap et al. 2022 — J Eur Acad Dermatol Venereol (10/10)
      • Ali et al. 2022 — Skin Res Technol (9/10)
      • Ali et al. 2024 — Dermatology (6/10) — contextual gap evidence
      • Seité et al. 2019 — Exp Dermatol (7/10)
    • Answer — use in CER Gap 2 declaration
  • T11: Literature search C1 — Autoimmune skin disease AI detection
    • Eligibility screening — image-based AI papers (subset of 105)
    • CRIT1-7 scoring — included papers
    • Key data extracted
      • Mathur et al. 2021 — Dermatol Ther (9/10)
      • Yu et al. 2025 — Photodiagnosis Photodyn Ther (9/10) [cross-reference from T9]
      • Bharathi et al. 2023 — Rheumatology (7/10) — SotA context
      • Garaiman et al. 2023 — Rheumatology (6/10) — SotA context
    • Answer — use in CER Gap 4 declaration
  • T12: Literature search C2 — UAS inter-rater benchmarks
    • Eligibility screening — urticaria-relevant papers (9 of 27)
    • CRIT1-7 scoring — included papers
    • Key data extracted
      • Hollis et al. 2018 — Am J Clin Dermatol (10/10)
      • Jauregui et al. 2019 — Health Qual Life Outcomes (9/10)
    • Answer — use in CER UAS severity section
  • Downstream edits triggered
    • CER edits (R-TF-015-003)
    • SotA edits (R-TF-015-011)
    • BSI response edits
    • CEP edits (R-TF-015-001)
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI Labs Group S.L.)