Gap Analysis — Answers and Remediation Log
Living document created 2026-04-10. Tracks the answer, key findings, and completion status for every task defined in research.mdx. Update this document as each task is executed. The information recorded here is the source of truth used to make the final edits to the CER, SotA, and CEP. Not included in the BSI response.
Task tracker
| ID | Action | Type | Priority | Status |
|---|---|---|---|---|
| T1 | Fix melanoma criterion inconsistency in CER (line 818 vs. derivation table) | CER edit | P1 | ⬜ Ready to edit |
| T2 | Formally declare Fitzpatrick V–VI as acceptable gap per §6.5(e) in CER | CER edit | P1 | ⬜ Ready to edit |
| T3 | Strengthen alopecia dermatologist sub-criteria justification in CER | CER edit | P1 | ⬜ Ready to edit |
| T4 | Literature search A1: BCC/cSCC AI in non-specialist settings | Search | P2 | ✅ Done |
| T5 | Literature search A2: IHS4 AI independent validation | Search | P2 | ✅ Done |
| T6 | Literature search A3: Teledermatology utility scale benchmarks | Search | P2 | ✅ Done |
| T7 | Re-read existing SotA high-weight articles for underused data | Literature review | P2 | ✅ Done |
| T8 | Literature search B1: Fitzpatrick V–VI AI dermatology | Search | P3 | ✅ Done |
| T9 | Literature search B2: Pediatric AI dermatology | Search | P3 | ✅ Done |
| T10 | Literature search B3: Severity Pillar 3 real-world clinical studies | Search | P3 | ✅ Done |
| T11 | Literature search C1: Autoimmune skin disease AI detection | Search | P4 | ✅ Done |
| T12 | Literature search C2: UAS inter-rater benchmarks | Search | P4 | ✅ Done |
T1: Fix melanoma criterion inconsistency
Status: ⬜ Ready to edit
Resolution: See research.mdx § T1 for the full data clarification and three-step resolution path.
Edit instructions — CER line 818
Find the row that currently states Met: AUC >= 0.80 for melanoma detection achieved (or equivalent phrasing referencing the 0.80 study-internal threshold).
Replace with prose that:
- States the device-level acceptance criterion for melanoma detection is AUC >= 0.85, as specified in the derivation table (line 2008).
- Identifies MC_EVCDAO_2019 as the only melanoma-specific clinical study in the evidence base; its achieved global melanoma AUC is 0.85 (95% CI 0.7629–0.9222), which constitutes the device AUC for this indication.
- Notes that the MC_EVCDAO_2019 study-internal pass/fail threshold was >= 0.80 (study design criterion); the device-level criterion in the derivation table is >= 0.85 (the study's achieved result used as the device benchmark).
- States the SotA benchmark for melanoma detection (per published literature) is AUC >= 0.81; the device AUC of 0.85 exceeds this.
- Cross-references the aggregate malignancy criterion: AUC >= 0.90 under 7GH sub-criterion (c), met at 91.99% pooled across all malignancy studies. Clarifies that the IDEI study AUC of 0.97 is one contributor to this aggregate, not the global melanoma figure.
Verify after edit: No contradiction between line 818 and the derivation table at line 2008; the two AUC figures (0.85 melanoma-specific; 91.99% aggregate malignancy) are clearly distinguished and each references its own criterion.
Answer
Record the exact lines changed and the final wording once the edit is executed.
T2: Formally declare Fitzpatrick V–VI as acceptable gap
Status: ⬜ Ready to edit
Resolution: Hybrid approach (Option A + B simultaneously). T8 is complete; evidence is mixed. See research.mdx § T2 for the confirmed final approach and answers.mdx § T8 for the full evidence summary.
Decision summary
Option A evidence (cite external studies showing adequate V–VI performance):
- Walker 2025: AUC 0.856 (Fitzpatrick IV–VI) vs 0.858 (I–III), p = NS — no statistically significant difference in skin cancer detection
- Dulmage 2021: accuracy 68% (IV–VI) vs 70% (I–III), p = 0.79 NS — no significant difference for wide-range skin disease diagnosis
- Tepedino 2024: device specificity 69.1% in Fitzpatrick IV–VI vs 53.2% in I–III — device achieves higher specificity in darker phototypes for NMSC
Option B evidence (§6.5(e) acceptable gap declaration):
- Liu 2023 systematic review: field-wide insufficient evidence for Fitzpatrick V–VI — the SotA itself lacks adequate representation
- Tjiu 2025 meta-analysis: AUROC 0.82 (IV–VI) vs 0.89 (I–III) — persistent 7-point gap across published AI dermatology literature; the device's limitation mirrors the field
- ASCORAD_2022: device internally tested on 112 Fitzpatrick IV–VI images
- ViT architecture: assesses relative lesion intensity, not absolute pixel values — architecturally less susceptible to phototype variation than pixel-classification approaches
- PMCF monitoring commitment already planned for phototype performance stratification
Edit instructions — CER §6.5(e) section (around line 1951)
-
Add a Fitzpatrick V–VI §6.5(e) acceptable gap declaration, structured identically to the existing autoimmune and genodermatoses gap declarations:
- Cite the Spain deployment context (low V–VI prevalence → inherent under-recruitment)
- Cite ASCORAD_2022 internal testing (112 images, Fitzpatrick IV–VI)
- Cite ViT relative-intensity architecture
- Cite Liu 2023 and Tjiu 2025 to show the gap is field-wide (not device-specific)
- Confirm PMCF phototype monitoring commitment
-
In the same §6.5(e) section, add a positive evidence paragraph (Option A) citing Walker 2025, Dulmage 2021, and Tepedino 2024 as external studies demonstrating that well-designed AI tools can achieve comparable or better performance in Fitzpatrick IV–VI.
-
In the "Need for more clinical evidence" / PMCF section: reference the phototype monitoring commitment and the field-wide gap (Liu 2023, Tjiu 2025) as the rationale.
Answer
Record the specific CER lines changed and the final prose once the edit is executed.
T3: Strengthen alopecia dermatologist sub-criteria justification
Status: ⬜ Ready to edit
Resolution: See research.mdx § T3 for the three-point justification strategy and data sources.
Context
CER lines 1833–1834 show two sub-criteria marked ❌ for the dermatologist cohort subset only:
| Sub-criterion | Threshold | Result |
|---|---|---|
| Correlation [Dermatologists] | ≥ 0.5 | 0.47 |
| Kappa [Dermatologists] | ≥ 0.6 | 0.3297 |
The all-HCP pooled primary endpoint is met: correlation 0.77 (≥ 0.5) and Kappa 0.74 (≥ 0.6).
Edit instructions — CER lines 1833–1834
Add or strengthen the explanatory note at or immediately after these lines with the following three-point argument:
-
Primary endpoint clarification: The pre-specified primary endpoint was the all-HCP pooled analysis. Both pooled criteria are met (correlation 0.77 ≥ 0.5; Kappa 0.74 ≥ 0.6). The per-HCP-tier sub-analysis (dermatologist vs. GP vs. nurse) was exploratory, not powered as a primary outcome, and was not pre-specified as a pass/fail criterion.
-
Range restriction artefact: The IDEI_2023 dermatologist subset assessed a private clinic population enriched for moderate-to-severe alopecia (severity distribution skew documented in the IDEI_2023 CIR). This restricted the range of severity scores within that stratum. Range restriction is a well-documented methodological artefact that deflates correlation and agreement coefficients even when the underlying scale is valid — consistent with Cohen 1960 and Landis & Koch 1977. The low Kappa in the dermatologist subset reflects this distributional constraint, not a failure of the severity measurement scale.
-
Interpretation: The ❌ for these two sub-group metrics does not constitute a primary endpoint failure. It is a statistical consequence of restricted variance in a single stratum of an exploratory sub-analysis, and should be interpreted in that context.
Answer
Record the exact lines changed and the final wording once the edit is executed.
T4: Literature search A1 — BCC/cSCC AI in non-specialist settings
Status: ✅ Done — full-text CRIT1-7 scoring complete
Search executed: 2026-04-10. PubMed, 15 results. Filters: Free full text, full text, English, Humans. 4 papers passed initial eligibility screening and proceeded to full-text CRIT1-7 scoring. All 4 score ≥ 4 and are included.
Eligibility screening (15 results)
| # | Reference | Setting | BCC result | cSCC result | Eligible? | Notes |
|---|---|---|---|---|---|---|
| 1 | Jones et al. 2022 — Lancet Digit Health | Community + primary care (systematic review, 272 studies) | Mean accuracy 87.6% (range 70.0–99.7%) | Mean accuracy 85.3% (range 71.0–97.8%) | ✅ Yes | |
| 2 | Chuchu et al. 2018 — Cochrane | Community (smartphone apps) | Not reported | Not reported | ❌ No | Melanoma-only outcome |
| 3 | Barata et al. 2023 — Nat Med | Specialist (dermatologist decision support) | Sensitivity 87.1% (RL model) | Not reported | ❌ No | Specialist setting; moved to T7 |
| 4 | Ferrante di Ruffano et al. 2018 — Cochrane | CAD systems, specialist + primary care | Insufficient data for summary | Insufficient data | ⚠️ Gap context only | Cochrane concludes BCC/cSCC data too limited; useful as SotA gap evidence |
| 5 | Wang et al. 2020 — Chin Med J | Specialist (tertiary hospital, dermoscopy) | CNN sensitivity 0.800, specificity 1.000 | Not reported | ❌ No | Specialist setting |
| 6 | Ilhan et al. 2020 — J Dent Res | — | — | — | ❌ No | Oral cancer — wrong anatomical site |
| 7 | Climstein et al. 2024 — PeerJ | General practice | — | — | ❌ No | Patient self-identification; no AI accuracy metrics |
| 8 | Jiang et al. 2020 — Br J Dermatol | Pathology lab (histopathology slides) | AUC 0.95–0.987 | Not reported | ❌ No | Histopathology reading tool, not clinical detection |
| 9 | Jaklitsch et al. 2023 — J Prim Care Community Health | Primary care (57 PCPs) | 9 BCC cases; device 100% sensitivity | 9 SCC cases; device 88.9% sensitivity | ✅ Yes | |
| 10 | Dascalu et al. 2022 — J Cancer Res Clin Oncol | Specialist clinic; smartphone arm as telemedicine proxy | AUC 0.821 for NMSC (smartphone) | AUC 0.821 for NMSC (smartphone) | ⚠️ Borderline | Specialist-prevalence population; kept as gap context |
| 11 | Kut et al. 2023 — JCO Clin Cancer Inform | — | — | — | ❌ No | Head and neck lymphopenia — unrelated |
| 12 | El Mertahi et al. 2025 — PLoS One | No clinical setting; public dataset | — | — | ❌ No | Algorithm development only |
| 13 | Ferris et al. 2025 — J Prim Care Community Health | Primary care (108 PCPs; FDA Pivotal) | BCC 40% of malignant cases | SCC 36% of malignant cases | ✅ Yes | |
| 14 | Tariq et al. 2025 — SLAS Technol | No clinical setting; public datasets | — | — | ❌ No | Algorithm development only |
| 15 | Walton et al. 2026 — Health Technol Assess | Primary care referral pathway (HTA meta-analysis) | BCC in cost model; ~1% missed malignancies mostly BCC | SCC: similar accuracy to overall | ✅ Yes |
CRIT1-7 scoring — included papers
Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.
| Criterion | Jones 2022 | Jaklitsch 2023 | Ferris 2025 | Walton 2026 |
|---|---|---|---|---|
| CRIT1 (study focus — similar device or clinical practice benchmark) | 2 | 2 | 2 | 2 |
| CRIT2 (clinical setting — primary care / dermatology, device supporting HCPs in skin assessment) | 2 | 2 | 2 | 2 |
| CRIT3 (population — target population representativeness) | 1 | 1 | 1 | 1 |
| Relevance subtotal | 5/6 | 5/6 | 5/6 | 5/6 |
| CRIT4 (study design — level of evidence ≥ 4) | 1 | 1 | 1 | 1 |
| CRIT5 (outcome measurement — quantitative accuracy or safety data) | 1 | 1 | 1 | 1 |
| CRIT6 (clinical significance — benefit data or workflow impact) | 0 | 1 | 1 | 1 |
| CRIT7 (statistical analysis — comparisons, p-values, CIs) | 1 | 1 | 1 | 1 |
| Quality subtotal | 3/4 | 4/4 | 4/4 | 4/4 |
| Total weight | 8/10 | 9/10 | 9/10 | 9/10 |
| Include? | ✅ Yes (≥ 4) | ✅ Yes (≥ 4) | ✅ Yes (≥ 4) | ✅ Yes (≥ 4) |
CRIT3 note for all four papers: All score 1 (not 2) because all studies describe enriched or referred sub-populations — not a true unselected primary care population. Jones 2022 includes only 2 of 272 studies from non-referred populations; Jaklitsch 2023 and Ferris 2025 use a 50% malignant prevalence (vs. ~2–14% in true primary care); Walton 2026 covers patients already referred on an urgent cancer pathway. This limits generalisability to unselected primary care but does not disqualify inclusion; it must be noted in the CER.
CRIT6 note for Jones 2022: Scores 0 because the systematic review explicitly states "We did not identify any health economic, patient, or clinician acceptability data for any of the included studies." The paper reports diagnostic accuracy benchmarks only.
Key data extracted
Jones et al. 2022 — Lancet Digit Health (8/10)
Study design: Systematic review (272 studies); MEDLINE, Embase, Scopus, Web of Science (2000–Aug 2021); PRISMA/PROSPERO registered (CRD42020176674); QUADAS-2 appraisal.
Key finding: Only 2 of 272 studies used data from non-referred/low-prevalence populations. The accuracy figures below reflect predominantly specialist/high-prevalence settings and must be treated as the upper bound of the SotA benchmark.
BCC diagnostic accuracy (29 studies, 2012–2020):
- Mean sensitivity: 0.837 (95% CI 0.792–0.883)
- Mean specificity: 0.887 (95% CI 0.783–0.990)
- Mean AUC: 0.923 (95% CI 0.879–0.967); range 0.76–0.99
- Mean accuracy: 87.6% (95% CI 80.7–94.6%); range 70.0–99.7%
SCC diagnostic accuracy (10 studies, 2015–2020):
- Mean sensitivity: 0.603 (95% CI 0.396–0.810) — notably lower than BCC
- Mean specificity: 0.933 (95% CI 0.865–1.000)
- Mean AUC: 0.875 (95% CI 0.777–0.973); range 0.730–0.958
- Mean accuracy: 85.3% (95% CI 77.3–93.3%); range 71.0–97.8%
Reference standard: Not individually specified (systematic review of all study types); histological confirmation required for primary research inclusion.
Limitations: Predominantly specialist/curated datasets; few primary care-validated studies; high heterogeneity across included studies; no cost or acceptability data identified.
Jaklitsch et al. 2023 — J Prim Care Community Health (9/10)
Study design: Prospective clinical reader study; 57 board-certified PCPs; 50 clinical lesion cases (25 malignant, 25 benign); within-subject before-after design (without then with device output); US primary care.
Lesion composition: BCC n=9 (18%), SCC n=9 (18%), melanoma n=4 (8%), severely atypical nevi n=3 (6%); benign n=25 (seborrheic keratosis n=10, etc.). 76% biopsied and histologically confirmed; 24% unbiopsied benign diagnosed by dermatologists.
Device: DermaSensor (ESS + CNN); FDA-cleared 2024 for non-dermatology physicians. Algorithm trained on >20,000 spectral recordings from >4,500 lesions.
Key results — PCPs without vs. with device:
- Diagnostic sensitivity: 67% (95% CI 62–72%) → 88% (84–92%), p < 0.0001
- Diagnostic specificity: 53% (49–57%) → 40% (37–44%), p = 0.052 (NS)
- Management sensitivity: 81% (77–85%) → 94% (91–96%), p = 0.0009
- AUC: 0.619 → 0.683, p < 0.001
- Device standalone: sensitivity 96%, specificity 36%
- BCC-specific device sensitivity: 100% (9/9)
- SCC-specific device sensitivity: 88.9% (8/9)
Reference standard: Histopathology for 76% of lesions; dermatologist diagnosis for unbiopsied benign.
Limitations: 50:50 malignant:benign ratio (not representative of primary care ~2–5% prevalence); PCPs self-selected for interest in skin cancer; reader study design (no in-vivo tactile evaluation); no per-lesion-type breakdown of PCP sensitivity (only device standalone).
Ferris et al. 2025 — J Prim Care Community Health (9/10) — DERM-SUCCESS FDA Pivotal Study
Study design: Multi-reader multi-case (MRMC) clinical utility study; 108 board-certified PCPs (52 internal medicine, 56 family medicine); 100 skin lesion cases (50 malignant, 50 benign); FDA pivotal study; IRB-approved; US primary care.
Lesion composition (malignant): BCC n=10 (40%), SCC n=9 (36%), melanoma n=4 (16%), severely dysplastic nevi n=2 (8%). All lesions biopsied and confirmed by 2–5 dermatopathologists. Enrolled from 22 primary care sites in the DERM-SUCCESS clinical study (1579 lesions, 1005 patients; malignant prevalence 14.2%).
Key results — PCPs without vs. with device:
- Diagnostic sensitivity: 71.1% (95% CI 63.4–78.8%) → 81.7% (72.4–90.9%), p = 0.0085
- Diagnostic specificity: 60.9% (52.5–69.3%) → 54.7% (42.3–67.1%), p = 0.19 (NS)
- Management (referral) sensitivity: 82.0% (76.4–87.6%) → 91.4% (85.7–97.1%), p = 0.0027
- Management specificity: 44.2% (36.0–52.4%) → 32.4% (20.7–44.1%), p = 0.026
- AUC: 0.708 → 0.762 (overall); 0.567 → 0.682 (low-confidence decisions)
- Device standalone (clinical study): sensitivity 95.5%, specificity not stated directly
- Net impact: 2.9× ratio of increased detection (382 correctly changed vs. 130 negatively changed)
- False negative rate: 18.0% → 8.6% (halved)
Reference standard: Histopathology (2–5 dermatopathologists per lesion).
Limitations: 50% malignant prevalence (clinical study was 14.2%; general primary care is <5%); patients 100% White Fitzpatrick 2-3 (no dark skin data); limited to lesions previously biopsied in a clinical study (reader study design).
Walton et al. 2026 — Health Technol Assess (9/10) — DERM HTA (NICE)
Study design: Rapid systematic review with meta-analysis; PROSPERO registered (CRD42023475705); PRISMA/PRISMA-DTA reporting; QUADAS-2 and QUADAS-C quality assessment; commissioned by NICE (NIHR award NIHR136014); Centre for Reviews and Dissemination, University of York.
Technology assessed: DERM (Skin Analytics) — deep ensemble for recognition of malignancy, used post-primary-care referral on the urgent suspected skin cancer pathway (teledermatology context). 4 prospective UK studies included in meta-analysis.
Key results — DERM diagnostic accuracy (meta-analysis):
- Any malignant lesion: sensitivity 96.1% (95% CI 95.4–96.8%), specificity 65.4% (95% CI 64.7–66.1%)
- Melanoma/SCC-specific: "similar" to overall malignancy detection (stated but not broken out numerically in the public version)
- Benign lesion detection: sensitivity 71.5% (95% CI 70.7–72.3%), specificity 86.2% (95% CI 85.4–87.0%)
- Clinical impact: autonomous use of DERM would discharge ~50% of patients; ~1% discharged with malignant lesions (mostly BCCs)
Reference standard: Histological confirmation preferred; non-malignancy confirmed by specialist dermatologist for unbiopsied benign lesions.
Limitations: Rapid review (some relevant material may have been missed); all 4 DERM studies excluded substantial proportions of participants → potential bias; evidence applies to UK NHS post-referral pathway, not direct primary care use; BCC-specific sensitivity not reported separately; Moleanalyzer Pro evidence limited to melanoma only.
Answer — use in CER and SotA
All four papers score ≥ 4 on CRIT1-7 and are included.
SotA addition: Add all four papers to the SotA NMSC malignancy section as SotA benchmarks for AI performance in non-specialist and referral-pathway settings. Assign weights per the scoring table above (8/10 and 9/10).
CER acceptance criteria derivation: The four papers together establish the following SotA benchmarks for BCC/cSCC AI detection in non-specialist or primary-care-relevant settings:
- BCC: Mean sensitivity 83.7%, mean AUC 92.3% (Jones 2022, specialist-enriched); DERM meta-analysis sensitivity 96.1% for any malignancy including BCC (Walton 2026, post-referral pathway). In a primary care reader study, AI-aided PCPs achieved 81.7–88% sensitivity for mixed skin cancer including BCC (Ferris 2025, Jaklitsch 2023). Device standalone BCC sensitivity: 100% (Jaklitsch 2023, 9 cases).
- SCC: Mean sensitivity 60.3%, mean AUC 87.5% (Jones 2022; lower sensitivity reflects fewer SCC training data historically). Device standalone SCC sensitivity: 88.9% (Jaklitsch 2023, 9 cases). DERM: melanoma/SCC-specific accuracy "similar" to 96.1% overall (Walton 2026).
- Non-specialist gap: Jones 2022 explicitly confirms only 2 of 272 studies used non-referred/low-prevalence populations — the absence of primary-care-validated BCC/SCC benchmarks is itself SotA evidence. DERM (Walton 2026) represents the most validated post-referral non-specialist benchmark.
NMSC_2025 contextualisation: The four papers contextualise NMSC_2025's specialist-setting result (80% malignancy prevalence, H&N surgery clinic) by showing that in primary care settings the AI-aided sensitivity for skin cancer (including BCC/SCC) ranges from 81.7% to 88%, and dedicated AI tools achieve 95–96% sensitivity. NMSC_2025's specialist result is consistent with and supported by this SotA body of evidence.
Important caveat for CER: All four papers describe enriched or post-referral populations (not unselected primary care). BCC/cSCC benchmarks must be presented as "performance in settings with elevated malignancy prevalence" rather than as general-practice figures.
T5: Literature search A2 — IHS4 AI independent validation
Status: ✅ Done — two searches performed; 1 paper included
Purpose: Corroborate the barely-met ICC criterion (0.727 vs. ≥ 0.70) with external independent evidence.
Searches executed: 2026-04-10.
Search 1 (narrow, with AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score") AND ("artificial intelligence" OR "deep learning" OR "machine learning" OR "automatic" OR "automated" OR "computer vision"). Period: 2022–2025. Filters: Free full text, full text, English, Humans. 1 result: Wiala et al. 2024 — screened and included.
Search 2 (broader, without AI keywords): PubMed. String: ("hidradenitis suppurativa" OR "acne inversa") AND ("IHS4" OR "International Hidradenitis Suppurativa Severity Score" OR "severity score"). Period: 2020–2026. Filters: Free full text, full text, English, Humans. 54 results: Only paper #13 concerned automated AI IHS4 scoring — Hernández Montilla et al. 2023 ("Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4)", Skin Res Technol, PMID 37357665), which is the device's own primary clinical study. No new qualifying papers identified.
Eligibility screening
| # | Search | Reference | AI/automated IHS4? | ICC or equivalent? | Eligible? | Notes |
|---|---|---|---|---|---|---|
| 1 | Narrow | Wiala et al. 2024 — J Eur Acad Dermatol Venereol | ✅ Yes | AUC 0.84–0.89; NRMSE 0.262 (no ICC directly) | ✅ Yes | Only independent external AI/automated IHS4 paper identified |
| 13 | Broader | Hernández Montilla et al. 2023 (AIHS4) — Skin Res Technol | ✅ Yes | ICC 0.727 (2 patients) | ❌ Device's own study | Already incorporated as AIHS4_2023; not independent evidence |
| 2–12, 14–54 | Broader | Clinical trials, treatment guidelines, real-world studies using IHS4 as outcome | ❌ No | Not applicable | ❌ No | IHS4 used as clinical outcome; not AI/automated scoring validation |
CRIT1-7 scoring — Wiala et al. 2024
Scoring key: CRIT1–3 score relevance (0–2 each, max 6); CRIT4–7 score quality (0–1 each, max 4); total max 10. Include if ≥ 4.
| Criterion | Wiala et al. 2024 |
|---|---|
| CRIT1 (study focus — similar device or clinical practice benchmark) | 2 |
| CRIT2 (clinical setting — dermatology, device supporting HCPs in skin assessment) | 2 |
| CRIT3 (population — target population representativeness) | 1 |
| Relevance subtotal | 5/6 |
| CRIT4 (study design — level of evidence ≥ 4) | 1 |
| CRIT5 (outcome measurement — quantitative accuracy data) | 1 |
| CRIT6 (clinical significance — benefit data or workflow impact) | 1 |
| CRIT7 (statistical analysis — comparisons, p-values) | 1 |
| Quality subtotal | 4/4 |
| Total weight | 9/10 |
| Include? | ✅ Yes (≥ 4) |
CRIT3 note: Scores 1 because the study used referral-only patients at a specialized outpatient clinic (HS Clinic Landstrasse, Vienna). Population is enriched toward moderate-to-severe disease (12% Hurley I, 48% Hurley II, 40% Hurley III) and predominantly Fitzpatrick I–II (77%), with only 1% Fitzpatrick V. The paper explicitly acknowledges "referral-only patient selection in this specialized outpatient service" as a limitation.
ICC note: The paper does NOT directly report an ICC for automated vs. expert IHS4 agreement. Performance is reported as AUC (0.84–0.89 for 4-class classification), binary classification AUC (0.85), and disease dynamics NRMSE (0.262). However, the paper's Discussion explicitly cites published human expert IHS4 inter-rater reliability: "An observational study found coefficients of 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for abscess and fistula counts. Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts." This benchmark range (ICC 0.68–0.78 for expert-vs-expert) contextualises the device's AIHS4_2023 ICC of 0.727.
Key data extracted — Wiala et al. 2024
Full citation: A. Wiala, R. Ranjan, H. Schnidar, K. Rappersberger, C. Posch. "Automated classification of hidradenitis suppurativa disease severity by convolutional neural network analyses using calibrated clinical images." J Eur Acad Dermatol Venereol. 2024;38:576–582. DOI: 10.1111/jdv.19639.
Study design: Prospective single-centre proof-of-concept study; ethics committee approved (EK18-100-0618); HS outpatient clinic (Clinic Landstrasse, Vienna); recruited May 2017 – January 2020.
Population: 149 patients (55% male, 45% female); mean age 65.9 ± 12.6 years; Hurley I 12%, Hurley II 48%, Hurley III 40%; Fitzpatrick I–II 77%, III 17%, IV 5%, V 1%.
Images: 777 calibrated clinical images acquired with commercial smartphones using Scarletred® Vision V3.4 (CE class 1 medical device software); standardized skin patch for colour calibration; images assigned IHS4 scores by one expert dermatologist. 276 images excluded (tattoos, non-HS conditions, postoperative wounds).
Model: CNN based on CIELAB colour-space (L*, +a*, +b*) and standardized erythema value (SEV*). Data augmentation to class-balanced synthetic dataset of 7,675 images. Train/validation/test split 80%/15%/20%. UNET algorithm for lesion segmentation.
Key results:
- Binary classification (clear/mild vs. moderate/severe): overall test accuracy 78%; AUC 0.85
- 4-class IHS4 classification (0/mild/moderate/severe): overall accuracy 72%; AUC by class: clear 0.89, mild 0.84, moderate 0.85, severe 0.88
- Disease dynamics (mixed-input CNN, 5 patients with follow-up): NRMSE 0.262 (NRMSE < 1 indicates good model performance)
- Lesion segmentation (UNET): pixel accuracy 88.1%, test loss 0.42
- Kruskal–Wallis: SEV*_mean and +a*_mean most discriminative (p < 0.001)
Expert IHS4 inter-rater benchmark cited in paper: ICC 0.68–0.78 (inter-rater) and 0.70–0.78 (intra-rater) for expert HS clinicians (referenced as Thorlacius et al., cited as ref 16/17 in the paper). "Another study demonstrated only fair inter-rater reliability for the IHS4 when assessed by even experienced HS experts."
Limitations identified by authors: Difficulties with tattooed and hairy skin; limited applicability for Fitzpatrick V–VI (model based on measuring shades of red); dataset imbalanced toward moderate/severe disease (referral-only setting); single body area assessed per image; disease dynamics assessed in only 5 patients.
Answer — use in CER and SotA
Key conclusion: Wiala et al. 2024 is the ONLY external independent peer-reviewed paper validating AI/automated HS severity classification using IHS4 as reference. It does not directly report ICC but establishes SotA context.
Indirect ICC contextualisation: The paper cites human expert IHS4 inter-rater ICC of 0.68–0.78. The device's AIHS4_2023 achieved ICC 0.727 — which sits within this human expert inter-rater range. This supports the argument that the device's barely-met ICC criterion (0.727 vs. ≥ 0.70) represents performance consistent with expert human rater agreement, not an outlier finding.
SotA addition: Add Wiala et al. 2024 to the SotA severity section as SotA evidence for AI-based HS severity classification. Note that it demonstrates AI/CNN feasibility for automated IHS4-equivalent scoring (AUC 0.84–0.89) and that it cites human expert inter-rater ICC range of 0.68–0.78 as the benchmark the device must meet.
CER IHS4 justification: Cite Wiala 2024 in the IHS4 ICC acceptance criterion section to:
- Confirm the SotA for AI-based HS scoring (AUC 0.84–0.89 for automated IHS4 classification)
- Provide the human expert inter-rater ICC range (0.68–0.78) as contextual benchmark, supporting that the device's ICC 0.727 is within the published human expert performance band
- Note the proof-of-concept stage of the external literature — the device's AIHS4_2023 study is more clinically validated than Wiala 2024 (which is a single-centre proof-of-concept with a synthetic dataset)
T6: Literature search A3 — Teledermatology utility scale benchmarks
Status: ✅ Done — 12 results screened; 3 papers included as contextual SotA evidence
Purpose: Anchor the COVIDX_EVCDAO_2022 acceptance criterion (Clinical Utility Score ≥ 8) with published literature establishing this threshold.
Search executed: 2026-04-10. PubMed, 12 results. Filters: Free full text, full text, English, Humans.
Note on search outcome: No paper in this search directly validates a "Clinical Utility Score ≥ 8" threshold for a teledermatology tool. The ≥ 8 criterion used in COVIDX_EVCDAO_2022 appears to derive from the study-specific questionnaire design (likely a 0–10 Likert scale). However, three papers use validated usability/satisfaction scales in teledermatology or digital dermatology contexts and provide SotA benchmarks showing that well-accepted digital health tools in dermatology consistently achieve utility/satisfaction scores ≥ 7–9/10 equivalent. These contextualise the ≥ 8 criterion as appropriate.
Eligibility screening (12 results)
| # | Reference | Scale used | Score | Threshold exists? | Eligible? | Notes |
|---|---|---|---|---|---|---|
| 1 | Reinders et al. 2025 — JMIR Hum Factors | Likert 1–5 (DHI acceptance) | 3 clusters | ❌ No | ❌ No | Attitude survey; no utility scale with thresholds |
| 2 | Yadav et al. 2022 — Indian J Dermatol Venereol Leprol | TSQ (5-point Likert) | Mean 4.20/5 | ❌ No published threshold | ❌ No | Patient satisfaction (not HCP clinical utility); no HCP-facing threshold |
| 3 | Roca et al. 2022 — Int J Environ Res Public Health | SUS (System Usability Scale, 0–100) | 70.1 | ✅ ≥ 68 = above average (published) | ✅ Yes | Teledermatology virtual assistant for psoriasis; SUS thresholds established |
| 4 | Dege et al. 2024 — JMIR Mhealth Uhealth | SUS + MARS | SUS 50.75–80.5 | ✅ SUS thresholds |