Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • Legit.Health Plus Version 1.1.0.0
  • Legit.Health Plus Version 1.1.0.1
  • Legit.Health version 2.1 (Legacy MDD)
  • Legit.Health US Version 1.1.0.0
  • Legit.Health Utilities
  • Licenses and accreditations
  • Applicable Standards and Regulations
  • BSI Non-Conformities
    • Technical Review
    • Clinical Review
      • Round 1
        • Item 0: Background & Action Plan
        • Item 1: CER Update Frequency
        • Item 2: Device Description & Claims
        • Item 3: Clinical Data
        • Item 4: Usability
        • Item 5: PMS Plan
        • Item 6: PMCF Plan
        • Item 7: Risk
        • completed-tasks
          • task-3b10-legacy-pms-document-hierarchy-refactor
          • task-3b14-ifu-integration-requirements-verification
          • task-3b4-mrmc-dark-phototypes
          • task-3b7-icd-per-epidemiological-group-vv
            • Per-Epidemiological-Group Performance — Analysis Notes
            • Class-to-Epidemiological-Group Mapping — autoimmune + genodermatosis
          • task-3b8-safety-confirmation-column-definition
          • task-3b9-legacy-pms-conclusions-into-plus-pms-plan
        • Coverage matrix
        • resources
        • Task 3b-5: Autoimmune and Genodermatoses Triangulated-Evidence Package
      • Evidence rank & phases
      • Pre-submission review of R-TF-015-001 CEP and R-TF-015-003 CER
  • Pricing
  • Public tenders
  • Trainings
  • BSI Non-Conformities
  • Clinical Review
  • Round 1
  • completed-tasks
  • task-3b7-icd-per-epidemiological-group-vv
  • Per-Epidemiological-Group Performance — Analysis Notes

Per-Epidemiological-Group Performance — Analysis Notes

Internal analysis record for the 2026-04-20 R-TF-028-* change-control entry. The regulatory-language prose for the audit-visible report lives in r-tf-028-006-aiml-release-report.mdx under "Per-Epidemiological-Group Performance"; this file is the internal write-up kept for traceability and for the next reviewer who needs to reproduce or extend the work. Lives in docs/bsi-non-conformities/… and does not ship to BSI.

Question​

Does the ICD classifier (current model version v27.5.1) perform comparably across epidemiological groups — in particular on the two sub-indications where Pillar 3 clinical evidence is thin (autoimmune dermatoses, genodermatoses)?

Dataset​

Held-out test set from the standing V&V pipeline for the classifier:

  • 36,321 test images, 346 ICD-11 categories
  • Cropped-image variant (manually-annotated bounding boxes, clinically-relevant signal)
  • Per-class probability distribution per image (six-decimal precision), aggregated binary indicators, entropy, ground-truth label

This is the same test set used for the aggregate Top-1/Top-3/Top-5 accuracy and the six binary-indicator AUCs reported in R-TF-028-006. No new data was collected for this sub-analysis; it is additive post-processing of existing V&V outputs under change control.

Method​

  1. One-hot class-to-group mapping G ∈ {0,1}^[346 × 7] curated by the internal clinical team. Seven column slots (autoimmune, genodermatosis, inflammatory, malignant_or_pigmented, infectious, benign_neoplastic, other). On 2026-04-20 only the autoimmune and genodermatosis columns are populated; the remaining five columns are schema-reserved. Clinical rationale for each class assignment — including the post-methodological-review tightening that removed five borderline autoimmune classes (SJS/TEN, EM, necrobiosis lipoidica, Sweet syndrome, pyoderma gangrenosum) and five borderline genodermatosis classes (café-au-lait spot, lymphatic malformation, three developmental vascular malformations) — is recorded in class-group-map.md. Multi-group membership permitted.
  2. Aggregated group probability per image: P_grouped = P · G (mirrors the existing aggregation used for binary-indicator post-processing).
  3. For each group j:
    • Class-level Top-K accuracy (K ∈ 5): on images whose ground-truth class belongs to group j, the image is a hit at Top-K if any of its top-K class indices (argsort of P descending) also belongs to group j.
    • Group-level Top-1 accuracy: argsort the seven aggregated group probabilities per image; hit if group j is the top-1 group.
    • AUC: roc_auc_score(y = 1[label ∈ group j], score = P_grouped[:, j]) — one-vs-rest, same form as the existing malignancy AUC.
  4. Accuracy 95 % CI: Wilson score interval.
  5. AUC 95 % CI: percentile bootstrap, 1000 draws, seed 42 (sklearn roc_auc_score on each resample).

Group-level Top-3 and Top-5 are trivially 1.000 when only two of seven groups are populated; they are not reported in the audit-visible tables until at least four groups are populated (scheduled for the next R-TF-028-* change-control update). They remain in per-group-metrics.csv for schema completeness.

Sanity anchors (reproduce the published aggregates before trusting new numbers)​

AnchorRecomputedPublished|Δ|
Aggregate Top-10.66050.6580.0025
Aggregate Top-30.82430.8210.0033
Aggregate Top-50.86760.8640.0036
Binary AUC — Malignant0.91820.9180.0002
Binary AUC — Pre-malignant0.87690.8780.0011
Binary AUC — Associated with Malignancy0.86540.8630.0024
Binary AUC — Pigmented Lesion0.95820.9590.0008
Binary AUC — Urgent Referral0.90440.9000.0044
Binary AUC — High-Priority Referral0.88870.8880.0007

All anchors within the ±0.01 tolerance set before the run. Additionally, the argsort implementation was verified on a random 1 % subsample: the model's pred column (official Top-1) reproduces exactly from np.argsort(P, axis=1)[:, -1] (0 mismatches out of 363). The pipeline is therefore trusted for the per-group derivations below.

Pre-specification of the success criterion​

The AUC ≥ 0.80 threshold used below is the pre-specified binary-indicator acceptance criterion recorded in R-TF-028-002 AI Development Plan and already applied to the six binary-indicator AUCs in R-TF-028-006. It is inherited unchanged for the per-epidemiological-group analysis — not a post-hoc threshold set after observing results.

Results — autoimmune dermatoses (tightened mapping)​

  • Classes mapped into the group: 38
  • Test images with ground truth in the group: N = 2,040
MetricPoint95 % CI
Class-level Top-1 accuracy0.6260.605 – 0.647
Class-level Top-3 accuracy0.8200.803 – 0.836
Class-level Top-5 accuracy0.8910.876 – 0.903
Group-level Top-1 accuracy0.9600.951 – 0.968
AUC0.9480.941 – 0.954

Interpretation: autoimmune group AUC 0.948 discriminates autoimmune from non-autoimmune dermatoses above the pre-specified ≥ 0.80 threshold and sits within the range of the six binary-indicator AUCs on the same test set (0.863 – 0.959). Class-level Top-5 (0.891) is above the aggregate Top-5 (0.864). Group-level Top-1 (0.960) indicates that when an image is truly autoimmune, the autoimmune group score is the largest of the two populated group scores in the overwhelming majority of cases.

Results — genodermatoses (tightened mapping)​

  • Classes mapped into the group: 31
  • Test images with ground truth in the group: N = 391
MetricPoint95 % CI
Class-level Top-1 accuracy0.4500.402 – 0.500
Class-level Top-3 accuracy0.6800.633 – 0.725
Class-level Top-5 accuracy0.7720.728 – 0.811
Group-level Top-1 accuracy0.6520.604 – 0.698
AUC0.9050.886 – 0.924

Interpretation: genodermatosis group AUC 0.905 discriminates genodermatoses from non-genodermatoses above the pre-specified ≥ 0.80 threshold and within the range of binary-indicator AUCs on the same test set. Class-level Top-K sits below the aggregate envelope (Top-1 0.450 vs aggregate 0.658; Top-3 0.680 vs aggregate 0.821; Top-5 0.772 vs aggregate 0.864). This is expected for a low-prevalence long-tail group where fine-grained class disambiguation is intrinsically hardest. The device's clinical output is the prioritised Top-5 differential view, and the AUC and group-level Top-1 are the quantities that map to how the classifier output contributes to clinical decision support; on those quantities the analytical-performance robustness claim is supported.

Pillar discipline​

This sub-analysis is Pillar 2 (technical/analytical performance at the classifier API level) per MDCG 2020-1 §4.4. It does not involve clinicians interpreting device outputs and does not satisfy Pillar 3. The triangulation in task-3b5 uses this Pillar 2 evidence as one of five ingredients; it is additive to the existing Pillar 3 §4.4 evidence from the pivotal investigations, MAN_2025 (supporting MRMC, ../task-3b4-mrmc-dark-phototypes/), and the Rank-4 legacy RWE from R-TF-015-012 (../completed-tasks/task-3b2-3b3-legacy-rwe-study/). Do not let the prose drift toward substituting Pillar 2 for Pillar 3.

Wide CIs and small-N discipline​

Genodermatoses N = 391 and autoimmune N = 2,040. Neither AUC CI crosses the 0.80 success criterion. Both groups therefore pass the same acceptance criterion as the six binary indicators, and both AUC point estimates are within the range of the binary-indicator AUCs on the same test set. Disclose wide CIs honestly in the report — do not narrow the interval by changing the bootstrap seed or draw count.

Mapping-tightening delta​

First-draft mapping: 42 autoimmune / 36 genodermatosis. After methodological review (recorded in class-group-map.md):

  • Removed from autoimmune: Stevens-Johnson / TEN, erythema multiforme, necrobiosis lipoidica, Sweet syndrome, pyoderma gangrenosum. Rationale: immune-mediated but not classical autoimmune; see class-group-map.md exclusions section.
  • Removed from genodermatosis: café-au-lait spot, lymphatic malformation, developmental capillary vascular malformations (LC50), developmental venous malformations (LC51), cutaneous vascular malformation (EF2Z). Rationale: non-specific sign (CAL) or developmental/congenital rather than inherited genodermatosis.

Effect on metrics: autoimmune AUC 0.946 → 0.948, genodermatosis AUC 0.899 → 0.905. Both moved slightly up (tightening removed image-harder edge cases).

Propagation​

  1. R-TF-028-006 AI Release Report — new subsection "Per-Epidemiological-Group Performance (sub-analysis)" after the Binary Indicators Performance Summary. Tables as above, plus a change-control entry referencing the clinical-review-triggered improvement. Aggregate metrics unchanged.
  2. task-3b5 Ingredient 2 — populated with the same numbers and the interpretive sentence; framed as additive Pillar 2 contribution to the triangulation, not a substitute for Pillar 3.
  3. Class-to-group mapping matrix to be handed to ML engineering lead for integration into the default V&V pipeline going forward.

Reproducibility​

Deterministic run (compute_group_metrics.py, argsort + matrix product + sklearn roc_auc_score). Bootstrap seed fixed. Run summary persisted in run-summary.json. The script, class map, and mapping matrix together re-derive every number in this file.

Previous
MRMC cross-study comparison — BI_2024 · PH_2024 · SAN_2024 · MAN_2025
Next
Class-to-Epidemiological-Group Mapping — autoimmune + genodermatosis
  • Question
  • Dataset
  • Method
  • Sanity anchors (reproduce the published aggregates before trusting new numbers)
  • Pre-specification of the success criterion
  • Results — autoimmune dermatoses (tightened mapping)
  • Results — genodermatoses (tightened mapping)
  • Pillar discipline
  • Wide CIs and small-N discipline
  • Mapping-tightening delta
  • Propagation
  • Reproducibility
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI Labs Group S.L.)