X-3: Disease categorisation decision

Internal working document

This page documents the team's decision on how to structure clinical evidence by disease category. It is the resolution of the tension identified during the BSI clarification meeting (2026-03-25) and the internal debriefs. This decision is a prerequisite for Items 2b and 3a.

The tension

The device outputs a probability distribution over all visible ICD-11 classes. It never produces a binary positive/negative for any specific condition. This is the factual description of the device architecture, and it is the intended use that we claim.

However, MDR clinical evaluation requirements demand that evidence demonstrate performance for the device's clinical indications. BSI's Nick stated explicitly: "the rules do not change" regardless of how the device frames its output. The evaluation must show that the probability distribution is clinically reliable per condition where clinical risk demands it.

This creates a tension: structuring evidence by disease category implicitly frames the device as "for diagnosing diseases," which conflicts with the ICD-11 probability distribution architecture. The question is whether and how to reconcile these two framings.

The resolution

The two framings operate at different levels and are not in conflict.

Level	Framing	Where it lives
Device description (what the device does)	ICD-11 probability distribution: the device ranks likelihoods across all visible dermatological ICD-11 categories. The physician sees a prioritised short-list, not a binary diagnosis.	CER § Device description, intended purpose
Clinical evaluation (how we assess the device)	Disease-category evidence: performance is assessed by clinical risk tier, with individual analysis for high-risk conditions and justified pooling for lower-risk categories.	CER § Analysis of clinical data, acceptance criteria, sufficiency justification

The device's architecture and intended use remain as described: a general classifier that outputs probability distributions. The clinical evaluation demonstrates that this output is reliable by assessing performance where it matters most, using a risk-proportionate, tiered evidence structure.

Regulatory basis for this approach

The following regulatory requirements and guidance documents support, and in some cases mandate, a condition-level or category-level evidence assessment:

Requirement	Source	What it demands
Sensitivity/specificity for major clinical indications	MEDDEV 2.7.1 Rev 4, Annex A7.3	Diagnostic devices must report performance metrics "for major clinical indications" specifically, not only as a pooled aggregate.
Separate VCA per claimed output	MDCG 2020-1, § Valid Clinical Association	"Each specific claimed output (diagnosis, severity grading, disease monitoring) requires separate VCA establishment." The device claims diagnostic support across ICD-11 categories; VCA must be established for representative categories, not only in aggregate.
Risk-based justification for pooling	MDCG 2020-6, Appendix III (evidence hierarchy)	Data pooled across conditions must have a risk-based justification for why pooling is appropriate. Without justification, pooled data is not acceptable for high-risk indications.
Narrow intended purpose if evidence insufficient	MDCG 2020-6, § 6.5(e)	If evidence is insufficient for an indication, the intended purpose must be narrowed, or the gap declared acceptable with PMCF.
Individual breakdown for high-risk conditions	BSI meeting, Nick (2026-03-25)	"For higher-risk conditions (melanoma, malignancies): an individual breakdown of acceptance criteria and evidence. BSI will specifically audit these."
Consider removing under-supported indications	BSI meeting, Nick (2026-03-25)	"Consider whether all indications are equally well-supported by data — and whether some should be removed from the claims if data is insufficient."
CEAR Section E: performance evaluation per indication	MDCG 2020-13, § E.2	The notified body assesses whether "the clinical performance endpoints are appropriate for each indication", implying per-indication scrutiny.

The three-tier evidence structure

Evidence is assessed at three tiers based on the clinical risk of misclassification:

Tier 1: Malignant conditions (individual analysis)

Clinical risk: Highest. Misclassification can delay cancer diagnosis, leading to disease progression and mortality.

Approach: Individual acceptance criteria per condition (melanoma) or per condition group (multiple malignant conditions). Dedicated studies provide primary evidence.

Evidence:

Condition/group	Primary study	Sample	Key metric
Melanoma	MC_EVCDAO_2019 (Cruces + Basurto)	105 patients, 36 melanoma	AUC 0.8482, Top-3 sensitivity 0.9032
Multiple malignant conditions	MC_EVCDAO_2019 + IDEI_2023 + DAO_O + DAO_PH + PH_2024 + SAN_2024	Melanoma, BCC, SCC, actinic keratosis across 6 studies	AUC 0.8983 (MC_EVCDAO), AUC 0.9669 (PH prospective)

Regulatory alignment: This tier satisfies Nick's explicit requirement for individual breakdown of high-risk conditions, MEDDEV A7.3's requirement for per-indication sensitivity/specificity, and MDCG 2020-6's prohibition on unsupported pooling for high-risk indications.

Tier 2: Rare diseases (grouped analysis)

Clinical risk: Moderate-high. Rare diseases are frequently misdiagnosed; delayed diagnosis leads to prolonged suffering and inappropriate treatment. These conditions require specialist expertise that primary care physicians often lack.

Approach: Grouped analysis with dedicated acceptance criteria. The rare diseases subgroup is explicitly defined in the BI_2024 study protocol.

Evidence:

Conditions in subgroup	Study	Sample	Key metric
GPP, acne conglobata, palmoplantar pustulosis, subcorneal pustular dermatosis, AGEP, pemphigus vulgaris	BI_2024 (Boehringer Ingelheim)	15 HCPs × 100 images = 1,449 evaluations	Rare disease accuracy: +26.77% improvement (25.56% → 57.88%)
Pustular psoriasis, HS	PH_2024 (Puerta de Hierro)	9 PCPs × 30 images	Pustular psoriasis: +299.64% relative improvement; HS: +24.14%

Regulatory alignment: MDCG 2020-6 § 6.5(e) requires either sufficient evidence per indication or narrowing the intended purpose. This subgroup analysis demonstrates that the device significantly improves rare disease diagnosis, addressed as a dedicated sub-criterion within benefit 7GH, with its own acceptance criterion (absolute Top-1 accuracy >= 54%).

Tier 3: General conditions (pooled with risk-based justification)

Clinical risk: Lower. For the typical case, misranking within these non-malignant categories leads to delayed or modified treatment rather than acute mortality. In exceptional cases (e.g., untreated bacterial cellulitis progressing to sepsis), delayed treatment of non-malignant conditions can have serious consequences; however, the physician's independent clinical assessment, not the device output alone, determines the management pathway. The device is a decision-support tool, not a stand-alone diagnostic.

Approach: Pooled analysis across conditions with explicit risk-based justification. The pooled studies cover a representative sample of the epidemiological landscape, documented using a 7-category framework.

Risk-based justification for pooling:

Comparable clinical consequence of misclassification. Within non-malignant, non-rare categories, the typical clinical consequence of an incorrect ranking is delayed or modified treatment. While individual exceptions exist (e.g., untreated infectious conditions can occasionally progress to serious complications), the physician's independent clinical assessment, not the device output alone, determines the management pathway, providing a safety net that is absent in standalone diagnostic scenarios. This risk profile is fundamentally different from malignant conditions, where a missed diagnosis directly impacts mortality.
Device architecture supports pooling. The device outputs a probability distribution over all ICD-11 categories simultaneously. It does not make independent per-condition predictions; it ranks likelihoods across the full ICD-11 space. Assessing how well this ranking performs across the general dermatological spectrum is therefore a natural and valid evaluation approach.
Representative sampling across epidemiological categories. The pooled studies include conditions from all major epidemiological categories of dermatological disease (see coverage matrix below), ensuring that the pooled analysis reflects the breadth of conditions encountered in clinical practice rather than being limited to a single disease area. Individual condition prevalence within the pool is not uniform (conditions like acne and dermatitis are more heavily represented than rare infections), but the pool is not restricted to any one category.
Consistent architecture supports the expectation of consistent capability. The uniform algorithm architecture (Vision Transformer) processes all input images through the same feature extraction pipeline regardless of condition. This means the model does not introduce condition-specific biases in its processing. While absolute performance varies by condition (as demonstrated in per-condition results tables), the architecture provides a technical basis for expecting that validated capability on representative conditions extends to other conditions within the same visual feature space. This is supporting evidence for pooling, not a guarantee of uniform performance.

Epidemiological framework: 7 categories of dermatological disease

To demonstrate that the clinical evidence portfolio representatively covers the full spectrum of dermatological conditions, we adopt the following epidemiological categorisation, based on the Global Burden of Disease Study (Karimkhani et al., 2017) and related prevalence literature:

Category	Approximate prevalence	Description
Infectious diseases	57%	Fungal (34%), bacterial (23%), viral infections
Other conditions	19%	Acne, alopecia, urticaria, and other common conditions
Inflammatory diseases	15%	Psoriasis, atopic dermatitis, hidradenitis suppurativa, eczema
Malignant and pre-malignant neoplasms	5%	Melanoma, BCC, SCC, actinic keratosis
Autoimmune diseases	3%	Lupus erythematosus, dermatomyositis, bullous diseases
Genodermatoses	1%	Epidermolysis bullosa, ichthyosis
Vascular conditions	1%	Haemangiomas, vascular malformations

Evidence coverage matrix

The following matrix shows which disease categories are represented in each clinical investigation:

Study	Infectious	Other	Inflammatory	Malignant	Autoimmune	Genodermatoses	Vascular
BI_2024	Impetigo, Tinea corporis	Acne (×3 variants)	GPP, dermatitis, psoriasis, HS, AGEP +4	—	Pemphigus vulgaris	—	—
PH_2024	—	Urticaria	Psoriasis (×2), HS	Melanoma, BCC, actinic keratosis	—	—	—
SAN_2024	Herpes, tinea, onychomycosis	Acne, alopecia, urticaria	Dermatitis, psoriasis	Melanoma	—	—	—
IDEI_2023	—	Androgenetic alopecia (96 pts)	—	Melanoma, BCC, SCC	—	—	—
MC_EVCDAO_2019	—	—	—	Melanoma (36), BCC (13), actinic K.	—	—	Angioma (5), haemangioma, angiokeratoma
AIHS4_2025	—	—	HS (severity)	—	—	—	—
COVIDX_2022	Folliculitis, herpes, tinea	Acne (67 pts), alopecia	Psoriasis, AD, HS, eczema, lichen planus, rosacea	Melanoma, BCC, SCC, actinic K.	—	—	Haemangioma (14)
DAO_O_2022	—	Alopecia	Psoriasis (×3), eczema (×3), AD	Melanoma (×4), BCC (×9), actinic K. (27)	Bullous pemphigoid (5)	—	Spider telangiectasis, pyogenic granuloma
DAO_PH_2022	Warts, molluscum, herpes	Urticaria	Psoriasis, AD, HS, lichen planus	BCC, SCC, melanoma	—	—	Angiomas
Coverage	4 studies	7 studies	7 studies	7 studies	2 studies	None	4 studies

Coverage assessment

Category	Coverage strength	Assessment
Infectious (57%)	Moderate	Present in 4 studies with bacterial (impetigo), fungal (tinea, onychomycosis), and viral (herpes, warts, molluscum) conditions. Per-condition sample sizes are small (2–10 images in MRMC studies), but COVIDX and DAO_PH include infectious conditions in real-world clinical settings. All three infection subtypes (bacterial, fungal, viral) are represented across the portfolio.
Other (19%)	Strong	Represented in 7 studies. Acne is well-represented (67 patients in COVIDX alone, plus multiple MRMC studies). Androgenetic alopecia has dedicated evidence (IDEI with 96 patients, AIHS4 for severity). Urticaria represented in PH_2024, SAN_2024, and DAO_PH_2022.
Inflammatory (15%)	Strong	Represented in 7 of 9 studies. Psoriasis (multiple subtypes), AD, HS, eczema, lichen planus, rosacea, and AGEP all covered. GPP has dedicated acceptance criteria (BI_2024 primary objective). HS has dedicated severity scoring (AIHS4_2025).
Malignant (5%)	Strong	Dedicated study (MC_EVCDAO: 105 patients, 36 melanoma). Melanoma, BCC, SCC, and actinic keratosis across 7 studies. Individual acceptance criteria established. Tier 1 analysis.
Autoimmune (3%)	Weak	Pemphigus vulgaris (BI_2024, 5 images) and bullous pemphigoid (DAO_O, 5 cases). Note: pemphigus vulgaris is already counted in Tier 2 (rare diseases subgroup), so the autoimmune-specific evidence that is not already accounted for elsewhere is effectively bullous pemphigoid only (5 cases in one study). Declared acceptable gap; see below.
Genodermatoses (1%)	None	No study in the portfolio includes conditions classifiable as genodermatoses. Declared acceptable gap; see below.
Vascular (1%)	Adequate	Angiomas and haemangiomas across 4 studies, with 14 haemangioma patients in COVIDX alone. Unlike autoimmune conditions (3%), vascular coverage is not declared a gap because: (a) more studies and more patients provide evidence, (b) vascular lesions are predominantly benign with low clinical risk of misclassification, and (c) the 1% prevalence requires proportionately less evidence depth than the 3% autoimmune category.

Declared acceptable gaps

Per MDCG 2020-6 § 6.5(e), when evidence is insufficient for an indication, the manufacturer must either narrow the intended purpose or declare the gap acceptable with justification and address it via PMCF.

We declare the following gaps as acceptable and do not narrow the intended purpose:

Gap A: Autoimmune diseases (3% prevalence)

Gap: Two autoimmune conditions appear in the evidence portfolio: pemphigus vulgaris (5 images in BI_2024) and bullous pemphigoid (5 cases in DAO_O_2022). However, pemphigus vulgaris is already accounted for within the Tier 2 rare diseases subgroup analysis, where it contributes to the rare disease sub-criterion of benefit 7GH. The autoimmune-specific evidence that is not already counted elsewhere is therefore limited to bullous pemphigoid (5 cases in a single study). No dedicated study addresses autoimmune conditions as a group.

Why acceptable:

Autoimmune skin conditions represent only 3% of dermatological presentations.
The device's intended use is as a decision-support tool; the physician always makes the final diagnosis. For autoimmune conditions, which typically require serological confirmation beyond visual assessment, the device's role is triage and differential ranking, not definitive diagnosis.
No safety concern: misranking an autoimmune condition does not carry acute mortality risk comparable to malignancy. The typical clinical consequence is delayed referral to specialist care, not acute harm.
Supporting confidence is provided by demonstrated performance on inflammatory and other conditions that share visual features with autoimmune presentations (erythema, scaling, vesiculation), though this is indirect evidence and does not substitute for direct validation.

PMCF activity: Prospective data collection on autoimmune conditions in real-world deployment, with per-condition accuracy tracking.

Gap B: Genodermatoses (1% prevalence)

Gap: No study in the clinical evidence portfolio includes conditions classifiable as genodermatoses (epidermolysis bullosa, ichthyosis, etc.). This category has zero direct representation.

Why acceptable:

Genodermatoses represent approximately 1% of dermatological presentations, the lowest-prevalence category in the epidemiological framework.
These conditions are typically diagnosed through genetic testing, family history, and clinical history rather than image-based assessment alone. The device's role for these conditions is supportive (differential ranking to prompt further investigation), not definitive.
The extreme rarity of these conditions makes prospective study recruitment impractical for pre-market evidence. Requiring dedicated pre-market validation for a 1%-prevalence category would be disproportionate to the clinical risk, given that the physician always makes the final diagnosis.
Post-market monitoring will capture any genodermatoses cases encountered in real-world use, enabling retrospective performance assessment as deployment scales.

PMCF activity: Passive surveillance of genodermatoses cases through PMS/PMCF data collection. Active recruitment is not feasible given the 1% prevalence.

Where this decision affects the CER

The following CER sections must be updated to reflect this disease categorisation framework:

1. Data Pooling Methodology (current CER § "Data Pooling Methodology")

Current state: Generic statement that pooling is justified by "clinical comparability and homogeneity" with no risk-based reasoning.

Required update: Add the risk-based justification for pooling (the 4 points from the Tier 3 rationale above). Reference the 7-category epidemiological framework to demonstrate representative sampling. Replace the vague "homogeneity" claim with the explicit argument: comparable clinical consequence of misclassification within non-malignant categories + device architecture supports pooling + representative coverage demonstrated.

2. Clarification on "Multiple conditions" (current CER § "Clarification on Multiple conditions")

Current state: States that "Multiple conditions" reflects "broad, representative inclusion aligned with diverse ICD-11 categories." This is too vague for BSI.

Required update: Replace with the 7-category framework. Show which categories are represented in which studies (the coverage matrix). Explain that "multiple conditions" encompasses conditions from 5 of 7 epidemiological categories (97% of presentations), with the remaining 2 declared as acceptable gaps. This transforms an assertion into demonstrated coverage with honest gap declaration.

3. Indication Coverage (current CER § "Justification of Sufficiency of Clinical Evidence", bullet 4)

Current state: References "anchor conditions": malignancy detection, chronic inflammatory diseases, and rare dermatological conditions. This is the current 3-tier language.

Required update: Expand to reference the 7-category framework and the coverage matrix. Add the declared acceptable gaps (autoimmune, genodermatoses) with justification. Show that the 3 tiers (malignant → individual, rare → grouped, general → pooled) are a deliberate risk-proportionate evidence assessment strategy, not an omission.

4. Acceptance Criteria Derivation from State of the Art (current CER § "Acceptance Criteria Derivation from State of the Art")

Current state: Acceptance criteria are presented by clinical domain (melanoma detection, diagnostic accuracy improvement, etc.), which is partially aligned with the tiered approach but not explicitly linked to the disease categorisation rationale.

Required update: Add introductory text explaining that acceptance criteria follow the 3-tier structure. Tier 1 (malignant) has condition-specific thresholds derived from SotA, addressed as sub-criterion (c) of benefit 7GH. Tier 2 (rare) has grouped thresholds justified by the evidence structure, addressed as sub-criterion (b) of benefit 7GH. Tier 3 (general) uses pooled thresholds justified by the risk-based pooling rationale, addressed as sub-criterion (a) of benefit 7GH. This makes the link between categorisation and acceptance criteria explicit and auditable.

5. Need for more clinical evidence / Gaps (current CER § "Need for more clinical evidence")

Current state: Declares 3 gaps (triage/prioritization, severity assessment, algorithmic stability). These are operational/performance gaps, not coverage gaps.

Required update: Add Gap A (autoimmune) and Gap B (genodermatoses) as declared acceptable evidence coverage gaps, with the justifications documented above. Link each to a specific PMCF activity. This satisfies MDCG 2020-6 § 6.5(e) and BSI's Item 6 requirement that PMCF activities be linked to identified gaps.

6. PMCF Plan

Impact: Two new PMCF activities must be added to address Gap A and Gap B. These feed directly into Item 6 (PMCF plan), which requires each activity to be linked to a specific gap.

Relationship to formal BSI items

BSI item	How X-3 feeds into it
Item 2a (device description)	The ICD-11 probability distribution framing remains the device description. No change to how the device is described, only to how the evidence is structured.
Item 2b (clinical benefits, SotA, acceptance criteria)	Acceptance criteria now follow the 3-tier structure. High-risk conditions get individual criteria. Pooled criteria have explicit risk-based justification. The 7-category framework demonstrates representative coverage.
Item 3a (clinical data analysis)	Clinical data analysis adopts the 3-tier structure. Per-study analyses reference which disease categories are covered. The coverage matrix becomes part of the sufficiency argument.
Item 3b (data sufficiency)	The declared gaps (autoimmune, genodermatoses) are formally documented with justification. Sufficiency is argued positively for 5 of 7 categories and declared acceptable with PMCF for 2.
Item 6 (PMCF plan)	Two new PMCF activities linked to the two declared gaps. This satisfies BSI's requirement that each PMCF activity be linked to a specific identified gap.

Known issue: "Level 1 and 2" evidence claim in CER executive summary

Fixed: 2026-03-28

The CER executive summary previously stated that the studies provide "high-quality clinical data (Level 1 and 2 according to the hierarchy of clinical evidence)." This claim was inconsistent with Nick's BSI meeting statement (MRMC studies are Rank 11 per MDCG 2020-6 Appendix III, not Level 1–2) and with the tiered evidence strategy in the same document.

The claim has been corrected in two CER locations (Quality bullet in executive summary, Conclusion on Sufficiency section). The replacement accurately characterises the portfolio by study type: MC_EVCDAO_2019 as analytical observational (Rank 2–4), real-world deployment studies as Rank 7–8, and MRMC simulated-use studies as Rank 11. The sufficiency argument is grounded in portfolio breadth and risk-proportionate design, not uniform high-level evidence.

Additionally, a third misleading claim, "a high level of evidence based on the MDCG 2020-6 guidance" in the introductory paragraph (line 27), has also been removed and replaced with an accurate, non-committal description that defers to the tiered evidence section for detail.

Decision status

Decision	Status	Owner
Adopt 3-tier evidence structure (malignant → individual, rare → grouped, general → pooled)	Decided	Team (2026-03-28)
Use 7-category epidemiological framework as pooling justification	Decided	Team (2026-03-28)
Declare autoimmune and genodermatoses as acceptable gaps	Decided	Team (2026-03-28)
Update CER § Tiered evidence assessment strategy (was Data Pooling + Multiple conditions)	Done	Taig
Update CER § Evidence coverage by disease category (new section)	Done	Taig
Update CER § Indication Coverage (executive summary)	Done	Taig
Update CER § Coverage of Indications and Conditions (sufficiency justification)	Done	Taig
Update CER § Acceptance Criteria Derivation (introductory paragraph)	Done	Taig
Update CER § Need for more clinical evidence (gaps 4 & 5 added)	Done	Taig
Fix CER "Level 1 and 2" evidence claim (exec summary + sufficiency section)	Done	Taig
Fix ICD-11 2D41 misclassified as malignant (moved to uncertain/pre-malignant)	Done	Taig
Fix 96% → 97% arithmetic across CER and X-3	Done	Taig
Remove false "binary malignancy indicator" output claim from CER	Done	Taig
Add PMCF activities for gaps A & B to PMCF Plan	Done	Jordi (Item 6)

The tension​

The resolution​

Regulatory basis for this approach​

The three-tier evidence structure​

Tier 1: Malignant conditions (individual analysis)​

Tier 2: Rare diseases (grouped analysis)​

Tier 3: General conditions (pooled with risk-based justification)​

Epidemiological framework: 7 categories of dermatological disease​

Evidence coverage matrix​

Coverage assessment​

Declared acceptable gaps​

Gap A: Autoimmune diseases (3% prevalence)​

Gap B: Genodermatoses (1% prevalence)​

Where this decision affects the CER​

1. Data Pooling Methodology (current CER § "Data Pooling Methodology")​

2. Clarification on "Multiple conditions" (current CER § "Clarification on Multiple conditions")​

3. Indication Coverage (current CER § "Justification of Sufficiency of Clinical Evidence", bullet 4)​

4. Acceptance Criteria Derivation from State of the Art (current CER § "Acceptance Criteria Derivation from State of the Art")​

5. Need for more clinical evidence / Gaps (current CER § "Need for more clinical evidence")​

6. PMCF Plan​

Relationship to formal BSI items​

Known issue: "Level 1 and 2" evidence claim in CER executive summary​

Decision status​