Per-Epidemiological-Group Performance — Analysis Notes

Internal analysis record for the 2026-04-20 R-TF-028-* change-control entry. The regulatory-language prose for the audit-visible report lives in r-tf-028-006-aiml-release-report.mdx under "Per-Epidemiological-Group Performance"; this file is the internal write-up kept for traceability and for the next reviewer who needs to reproduce or extend the work. Lives in docs/bsi-non-conformities/… and does not ship to BSI.

Question

Does the ICD classifier (current model version v27.5.1) perform comparably across epidemiological groups — in particular on the two sub-indications where Pillar 3 clinical evidence is thin (autoimmune dermatoses, genodermatoses)?

Dataset

Held-out test set from the standing V&V pipeline for the classifier:

36,321 test images, 346 ICD-11 categories
Cropped-image variant (manually-annotated bounding boxes, clinically-relevant signal)
Per-class probability distribution per image (six-decimal precision), aggregated binary indicators, entropy, ground-truth label

This is the same test set used for the aggregate Top-1/Top-3/Top-5 accuracy and the six binary-indicator AUCs reported in R-TF-028-006. No new data was collected for this sub-analysis; it is additive post-processing of existing V&V outputs under change control.

Method

One-hot class-to-group mapping G ∈ {0,1}^[346 × 7] curated by the internal clinical team. Seven column slots (autoimmune, genodermatosis, inflammatory, malignant_or_pigmented, infectious, benign_neoplastic, other). On 2026-04-20 only the autoimmune and genodermatosis columns are populated; the remaining five columns are schema-reserved. Clinical rationale for each class assignment — including the post-methodological-review tightening that removed five borderline autoimmune classes (SJS/TEN, EM, necrobiosis lipoidica, Sweet syndrome, pyoderma gangrenosum) and five borderline genodermatosis classes (café-au-lait spot, lymphatic malformation, three developmental vascular malformations) — is recorded in class-group-map.md. Multi-group membership permitted.
Aggregated group probability per image: P_grouped = P · G (mirrors the existing aggregation used for binary-indicator post-processing).
For each group j:
- Class-level Top-K accuracy (K ∈ 5): on images whose ground-truth class belongs to group j, the image is a hit at Top-K if any of its top-K class indices (argsort of P descending) also belongs to group j.
- Group-level Top-1 accuracy: argsort the seven aggregated group probabilities per image; hit if group j is the top-1 group.
- AUC: roc_auc_score(y = 1[label ∈ group j], score = P_grouped[:, j]) — one-vs-rest, same form as the existing malignancy AUC.
Accuracy 95 % CI: Wilson score interval.
AUC 95 % CI: percentile bootstrap, 1000 draws, seed 42 (sklearn roc_auc_score on each resample).

Group-level Top-3 and Top-5 are trivially 1.000 when only two of seven groups are populated; they are not reported in the audit-visible tables until at least four groups are populated (scheduled for the next R-TF-028-* change-control update). They remain in per-group-metrics.csv for schema completeness.

Sanity anchors (reproduce the published aggregates before trusting new numbers)

Anchor	Recomputed	Published	\|Δ\|
Aggregate Top-1	0.6605	0.658	0.0025
Aggregate Top-3	0.8243	0.821	0.0033
Aggregate Top-5	0.8676	0.864	0.0036
Binary AUC — Malignant	0.9182	0.918	0.0002
Binary AUC — Pre-malignant	0.8769	0.878	0.0011
Binary AUC — Associated with Malignancy	0.8654	0.863	0.0024
Binary AUC — Pigmented Lesion	0.9582	0.959	0.0008
Binary AUC — Urgent Referral	0.9044	0.900	0.0044
Binary AUC — High-Priority Referral	0.8887	0.888	0.0007

All anchors within the ±0.01 tolerance set before the run. Additionally, the argsort implementation was verified on a random 1 % subsample: the model's pred column (official Top-1) reproduces exactly from np.argsort(P, axis=1)[:, -1] (0 mismatches out of 363). The pipeline is therefore trusted for the per-group derivations below.

Pre-specification of the success criterion

The AUC ≥ 0.80 threshold used below is the pre-specified binary-indicator acceptance criterion recorded in R-TF-028-002 AI Development Plan and already applied to the six binary-indicator AUCs in R-TF-028-006. It is inherited unchanged for the per-epidemiological-group analysis — not a post-hoc threshold set after observing results.

Results — autoimmune dermatoses (tightened mapping)

Classes mapped into the group: 38
Test images with ground truth in the group: N = 2,040

Metric	Point	95 % CI
Class-level Top-1 accuracy	0.626	0.605 – 0.647
Class-level Top-3 accuracy	0.820	0.803 – 0.836
Class-level Top-5 accuracy	0.891	0.876 – 0.903
Group-level Top-1 accuracy	0.960	0.951 – 0.968
AUC	0.948	0.941 – 0.954

Interpretation: autoimmune group AUC 0.948 discriminates autoimmune from non-autoimmune dermatoses above the pre-specified ≥ 0.80 threshold and sits within the range of the six binary-indicator AUCs on the same test set (0.863 – 0.959). Class-level Top-5 (0.891) is above the aggregate Top-5 (0.864). Group-level Top-1 (0.960) indicates that when an image is truly autoimmune, the autoimmune group score is the largest of the two populated group scores in the overwhelming majority of cases.

Results — genodermatoses (tightened mapping)

Classes mapped into the group: 31
Test images with ground truth in the group: N = 391

Metric	Point	95 % CI
Class-level Top-1 accuracy	0.450	0.402 – 0.500
Class-level Top-3 accuracy	0.680	0.633 – 0.725
Class-level Top-5 accuracy	0.772	0.728 – 0.811
Group-level Top-1 accuracy	0.652	0.604 – 0.698
AUC	0.905	0.886 – 0.924

Interpretation: genodermatosis group AUC 0.905 discriminates genodermatoses from non-genodermatoses above the pre-specified ≥ 0.80 threshold and within the range of binary-indicator AUCs on the same test set. Class-level Top-K sits below the aggregate envelope (Top-1 0.450 vs aggregate 0.658; Top-3 0.680 vs aggregate 0.821; Top-5 0.772 vs aggregate 0.864). This is expected for a low-prevalence long-tail group where fine-grained class disambiguation is intrinsically hardest. The device's clinical output is the prioritised Top-5 differential view, and the AUC and group-level Top-1 are the quantities that map to how the classifier output contributes to clinical decision support; on those quantities the analytical-performance robustness claim is supported.

Pillar discipline

This sub-analysis is Pillar 2 (technical/analytical performance at the classifier API level) per MDCG 2020-1 §4.4. It does not involve clinicians interpreting device outputs and does not satisfy Pillar 3. The triangulation in task-3b5 uses this Pillar 2 evidence as one of five ingredients; it is additive to the existing Pillar 3 §4.4 evidence from the pivotal investigations, MAN_2025 (supporting MRMC, ../task-3b4-mrmc-dark-phototypes/), and the Rank-4 legacy RWE from R-TF-015-012 (../completed-tasks/task-3b2-3b3-legacy-rwe-study/). Do not let the prose drift toward substituting Pillar 2 for Pillar 3.

Wide CIs and small-N discipline

Genodermatoses N = 391 and autoimmune N = 2,040. Neither AUC CI crosses the 0.80 success criterion. Both groups therefore pass the same acceptance criterion as the six binary indicators, and both AUC point estimates are within the range of the binary-indicator AUCs on the same test set. Disclose wide CIs honestly in the report — do not narrow the interval by changing the bootstrap seed or draw count.

Mapping-tightening delta

First-draft mapping: 42 autoimmune / 36 genodermatosis. After methodological review (recorded in class-group-map.md):

Removed from autoimmune: Stevens-Johnson / TEN, erythema multiforme, necrobiosis lipoidica, Sweet syndrome, pyoderma gangrenosum. Rationale: immune-mediated but not classical autoimmune; see class-group-map.md exclusions section.
Removed from genodermatosis: café-au-lait spot, lymphatic malformation, developmental capillary vascular malformations (LC50), developmental venous malformations (LC51), cutaneous vascular malformation (EF2Z). Rationale: non-specific sign (CAL) or developmental/congenital rather than inherited genodermatosis.

Effect on metrics: autoimmune AUC 0.946 → 0.948, genodermatosis AUC 0.899 → 0.905. Both moved slightly up (tightening removed image-harder edge cases).

Propagation

R-TF-028-006 AI Release Report — new subsection "Per-Epidemiological-Group Performance (sub-analysis)" after the Binary Indicators Performance Summary. Tables as above, plus a change-control entry referencing the clinical-review-triggered improvement. Aggregate metrics unchanged.
task-3b5 Ingredient 2 — populated with the same numbers and the interpretive sentence; framed as additive Pillar 2 contribution to the triangulation, not a substitute for Pillar 3.
Class-to-group mapping matrix to be handed to ML engineering lead for integration into the default V&V pipeline going forward.

Reproducibility

Deterministic run (compute_group_metrics.py, argsort + matrix product + sklearn roc_auc_score). Bootstrap seed fixed. Run summary persisted in run-summary.json. The script, class map, and mapping matrix together re-derive every number in this file.

Question​

Dataset​

Method​

Sanity anchors (reproduce the published aggregates before trusting new numbers)​

Pre-specification of the success criterion​

Results — autoimmune dermatoses (tightened mapping)​

Results — genodermatoses (tightened mapping)​

Pillar discipline​

Wide CIs and small-N discipline​

Mapping-tightening delta​

Propagation​

Reproducibility​