MRMC cross-study comparison — BI_2024 · PH_2024 · SAN_2024 · MAN_2025

Scope: internal, audit-invisible. All four MRMC investigations that sit in the Plus technical file compared against the same endpoint template. Figures pulled from signed CIRs and, for MAN_2025, computed live from the locked de-identified dataset at apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data/.

Last refresh: 2026-04-20 (MAN_2025 data lock 2026-04-17; BI/PH/SAN values from their signed CIRs).

Primary endpoint template (all four studies): paired top-1 diagnostic accuracy, HCP unaided vs HCP aided by the device, pre-specified acceptance criterion ≥10 pp absolute improvement, McNemar two-sided p<0.05.

Study design at a glance

Study	Purpose	CIP location	CIR location	Readers	Cases / image set	Paired obs	Design
BI_2024	Rare dermatological diseases	`Investigation/bi-2024/r-tf-015-005.mdx`	`Investigation/bi-2024/r-tf-015-006.mdx`	15 (11 PCP + 4 derm)	~100 curated images, rare-disease focus	1,449	Self-controlled MRMC, paired unaided→aided, remote web platform
PH_2024	Pigmented skin lesions / photographic	`Investigation/ph-2024/r-tf-015-005.mdx`	`Investigation/ph-2024/r-tf-015-006.mdx`	9	30 image sets, 8 diagnostic classes	~270	Self-controlled MRMC, paired unaided→aided
SAN_2024	General dermatology, mixed conditions	`Investigation/san-2024/r-tf-015-005.mdx`	`Investigation/san-2024/r-tf-015-006.mdx`	16 (10 PCP + 6 derm)	29 images × 16 readers (401 completed)	401	Prospective observational MRMC, remote web platform
MAN_2025	Fitzpatrick V–VI phototypes only (PMCF)	`Investigation/man-2025/r-tf-015-004.mdx`	`Investigation/man-2025/r-tf-015-006.mdx`	16 (primary) / 19 enrolled	149 curated atlas images (all FP V–VI)	2,376	Three-stage MRMC (unaided → aided → referral), self-controlled

Primary endpoint — paired top-1 accuracy improvement (pooled primary cohort)

Study	Unaided	Aided	Δ (pp)	McNemar p	Acceptance	Status
BI_2024	47.94%	63.06%	+15.12	<0.001	≥10 pp	PASS
PH_2024	63.70%	81.85%	+18.15	<0.001	≥10 pp	PASS
SAN_2024	68.08%	88.78%	+20.70	<0.0001	≥10 pp	PASS
MAN_2025	41.79%	65.07%	+23.27	≈1.0×10⁻¹⁰⁸ (χ² cc = 490.67)	≥10 pp	PASS

All four studies clear the ≥10 pp pre-specified bar. MAN_2025 has the largest improvement (+23.27 pp) and the lowest unaided baseline.

MAN_2025 computation provenance

The MAN_2025 numbers above are computed directly from the locked dataset in this repo:

# One-shot computation run from repo root on 2026-04-20
import json
base = 'apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data'
subs = json.load(open(f'{base}/submissions.json'))     # 8542 rows, categories: diagnosis / assisted-diagnosis / referral
meta = json.load(open(f'{base}/meta.json'))            # qualifiedReaders list (16 in primary cohort)
cases = json.load(open(f'{base}/cases.json'))          # 149 cases

# Pair (readerCode, caseId) submissions for category ∈ {diagnosis, assisted-diagnosis}
# Compare each answer to case.correctCondition (exact-string)
# Result: 2376 paired obs · unaided 993/2376 (41.79%) · aided 1546/2376 (65.07%) · Δ +23.27 pp
# McNemar discordant pairs: b=34 (correct→incorrect) · c=587 (incorrect→correct) · χ² cc = 490.67

These numbers match what <Man2025PrimaryOutcomeTable /> and <Man2025AcceptanceCriteriaResultsTable /> render in the CIR at build time. The analytics live in apps/qms/src/components/Man2025/analytics.ts (deterministic, no network, unit-testable). The thresholds (≥10 pp, etc.) live once in packages/ui/src/components/PerformanceClaimsAndClinicalBenefits/performanceClaims.ts under studyId: "MAN_2025" rows (M2N, M2A, M2R). See apps/qms/src/components/Man2025/CLAUDE.md for the pipeline.

Secondary endpoints

Sensitivity / specificity (HCP decision, unaided → aided)

Study	Sensitivity	Δ	Specificity	Δ	Notes
BI_2024	52.61% → 71.04%	+18.43	56.45% → 75.83%	+19.38	Pooled across primary cohort
PH_2024	68.55% → 83.15%	+14.60	78.01% → 89.91%	+11.90	Pooled across 9 readers
SAN_2024	not separately reported — top-1 accuracy is the confirmatory endpoint	—	—	—	See SAN_2024 CIR §Secondary endpoints
MAN_2025	malignant-case accuracy 56.88% → 78.75% (n=160 paired obs on 10 malignant cases)	+21.88	—	—	Descriptive only — see §Stage 3 referral for referral-sensitivity

Specialty strata — paired top-1 accuracy, broken out by reader specialty

The four studies were designed with different reader mixes, so cross-study specialty comparison is not a like-for-like contrast, but the shape of the result is consistent: the lower the unaided baseline, the larger the Δ from device assistance, across every specialty and every study.

Dermatologists (attending + residents, where applicable)

Study	n readers	Paired obs	Unaided	Aided	Δ (pp)	Threshold	Status
BI_2024	4 (att.)	400	57.25%	65.65%	+8.39	≥5 pp	Directional, supportive (n=4 under-powered)
PH_2024	0	—	—	—	—	—	No dermatologists in PH_2024 panel
SAN_2024	6 (att.)	91	76.47%	86.93%	+10.50	≥5 pp	PASS
MAN_2025	9 (3 att. + 6 res.)	1,334	47.90%	64.24%	+16.34	— (pooled is the endpoint)	Exploratory; derm baseline is low because FP V–VI images are harder even for specialists

Primary care / general practitioners (physicians)

Study	n readers	Paired obs	Unaided	Aided	Δ (pp)	Threshold	Status
BI_2024	11	1,049	44.71%	61.71%	+17.00	≥10 pp	PASS
PH_2024	9	~270	63.70%	81.85%	+18.15	≥10 pp	PASS (PH_2024 is PCP-only by design)
SAN_2024	10	310	62.90%	89.92%	+27.00	≥10 pp	PASS
MAN_2025	4 (1 att. + 3 res.)	596	36.58%	64.93%	+28.36	— (pooled is the endpoint)	Exploratory; 4 primary-care readers in MAN_2025

Nursing (MAN_2025 only — CIP-eligible category)

Study	n readers	Paired obs	Unaided	Aided	Δ (pp)
MAN_2025	3	446	30.49%	67.71%	+37.22

Nursing was not admitted as a reader category in BI_2024, PH_2024 or SAN_2024 (those CIPs pre-date the CIP correction that explicitly admits licensed nurses with skin/wound scope). MAN_2025 is the first study in the programme to include them.

MAN_2025 by qualification tier (fully-qualified attendings + senior nurses vs MIR residents)

Tier	n readers	Paired obs	Unaided	Aided	Δ (pp)
Fully-qualified-target (attendings + senior nurses)	7	1,039	38.50%	69.59%	+31.09
Resident-target (MIR residents)	9	1,337	44.35%	61.56%	+17.20

Counter-intuitive at first glance: residents have a higher unaided baseline than attendings in MAN_2025. That's because the "fully-qualified" bucket mixes attending dermatologists, attending primary-care, and three licensed nurses with dermatology/wound scope — the nurse readers drag the unaided baseline down, and the device then pulls them up more than it pulls up the MIR residents. The CIR's sensitivity row reports this explicitly; the regulatory endpoint remains the pooled primary-cohort Δ of +23.27 pp.

What the per-specialty patterns say

Specialists converge aided. Dermatologists end up in a 64–87% aided-accuracy band across all three studies that recruited them. BI_2024 dermatologist aided (65.65%) and MAN_2025 dermatologist aided (64.24%) sit almost on top of each other despite the cohorts being entirely different — in BI it's rare pustular dermatoses on mostly European skin; in MAN it's common conditions on FP V–VI skin. The device plateau is similar.
PCPs benefit most in absolute terms. +17.00 (BI), +18.15 (PH), +27.00 (SAN), +28.36 (MAN) — every PCP stratum clears ≥10 pp with room to spare. MAN_2025 has the largest PCP Δ because the PCP baseline on FP V–VI is very low (36.58%).
Nursing is the strongest Δ of any stratum in the programme (+37.22 pp). That is directly regulatorily important: it supports the CEP claim that the device delivers value for the full intended-user population, not just physicians, and it supports Celine's Pillar-3 indirect-benefit causal chain (less-expert reader → more device benefit → better patient decision).
Cross-study specialty comparisons must be read carefully. The image sets differ; the conditions differ; reader experience distributions differ. You can compare shapes (derms always lowest Δ, PCPs always larger, nursing largest where present) but you cannot compare levels (SAN's 89.92% aided PCP accuracy on easy derm cases does not mean SAN's PCPs are better than MAN's PCPs; the cases are easier).

Data provenance for the MAN_2025 per-specialty rows

Computed from the locked dataset by bucketing qualified readers by readers.json → specialty:

Specialty (onboarding form)	Qualified readers in MAN_2025
`dermatology`	R-01, R-04, R-05, R-07, R-08, R-09, R-10, R-12, R-19 (n=9)
`general` (primary care)	R-02, R-13, R-14, R-18 (n=4)
`nursing`	R-06, R-16, R-17 (n=3)

Then the same (diagnosis, assisted-diagnosis) paired comparison as the primary endpoint, filtered to each specialty bucket. The code for this lives in apps/qms/src/components/Man2025/analytics.ts (filterReaders + computePaired); see apps/qms/src/components/Man2025/CLAUDE.md for cohort-API semantics.

Stage 3 referral (MAN_2025 only — exploratory descriptive)

Metric	Result
Malignant cases referred for specialist review	150/160 = 93.75%
Benign cases correctly NOT referred	875/2220 = 39.41%
Device-level ROC AUC on malignancy (atlas truth)	0.878 (10 malig / 139 benign)

The stage 3 referral readout is descriptive only. 10 malignant cases is insufficient for a confirmatory malignancy-accuracy or malignancy-referral-sensitivity claim; that is delegated to the NMSC dedicated investigation and to the PMCF Plan (R-TF-007-002).

Usability / workflow (SAN_2024 and PH_2024)

Study	No-referral rate	Remote-consult feasibility	Utility score
BI_2024	not reported	not reported	not reported
PH_2024	48.89%	60.74%	—
SAN_2024	58.1%	55.11%	8.0/10 usability · 7.3/10 diagnostic utility
MAN_2025	referral data reported as Stage 3 descriptive (above)	—	—

Cross-study patterns worth calling out

All four studies pass the ≥10 pp bar. Across four independent cohorts and four different reader panels (total 56 HCPs, 4,500+ paired observations), the device's core diagnostic-accuracy claim holds. This is the argument the CER uses in §Clinical performance — confirmatory MRMC programme.
Baseline accuracy tracks cohort difficulty exactly as expected.
- SAN_2024 (general derm, mostly FP I–III): highest unaided baseline 68.08%
- PH_2024 (pigmented lesions, experienced readers): 63.70%
- BI_2024 (rare diseases, harder diagnoses even for specialists): 47.94%
- MAN_2025 (FP V–VI, under-represented phototypes, harder for HCPs AND harder for the device): 41.79%
The harder the cohort, the larger the lift from the device. This is the main regulatory story: MAN_2025 has the lowest baseline and the largest Δ (+23.27 pp). BI_2024 rare-disease pooled stratum shows +32.32 pp. SAN_2024 PCP stratum shows +27.00 pp. The device delivers the most incremental value where the clinician is least confident. This supports the Pillar-3 clinical-performance claim and the indirect-benefit causal chain (see celine-clinical-consultant agent).
Aided performance converges across cohorts. Aided top-1 accuracy sits in a 63–89% band regardless of cohort difficulty. The spread narrows dramatically post-device (unaided range 41.79–68.08% → aided range 63.06–88.78%). The device compresses reader variance — a secondary argument that can be deployed in PMCF reasoning if reader-variance claims become relevant.
Statistical power is not the constraint in any of these. Even PH_2024 with n=9 readers produces p<0.001 on a self-controlled design. MAN_2025's 2,376 paired observations (χ² cc = 490.67, p ≈ 1×10⁻¹⁰⁸) make it by far the most statistically robust of the four.
BSI's "MRMC is not clinical data" stance (Nick, clarification meeting) does not hurt us. The four MRMC studies are framed as Rank 11 Pillar 3 simulated-use performance evidence — supporting, not primary. Primary real-world evidence is delivered by the legacy-device RWE study (task-3b2-3b3-legacy-rwe-study/). The MRMC programme shows controlled-environment performance; the RWE study shows real-world performance. Together they cover both dimensions (see §Evidence-hierarchy positioning in CLAUDE.md).

Caveats worth documenting

BI_2024 and PH_2024 per-pathology breakdowns are exploratory, not confirmatory. No multiple-testing correction was pre-specified; only aggregate pooled endpoints support the CE claim.
SAN_2024 does NOT support Fitzpatrick V–VI claims (Fitzpatrick V = 1 image = 3.6%, Fitzpatrick VI = 0 images in its set). MAN_2025 exists specifically to close this gap. SAN_2024's limitations section makes this explicit.
MAN_2025's malignancy readout is descriptive only. 10 malignant cases (7 melanoma, 3 BCC) are not enough for confirmatory malignancy claims. The NMSC study is the confirmatory source.
Cross-study Δ comparisons are directionally meaningful, not like-for-like. The image sets differ by design; cohort baselines differ; reader panels differ in specialty mix. The common axis is the device and the ≥10 pp threshold — not the absolute numbers. Do not draw inferences like "MAN_2025 is 1.5× better than BI_2024"; that is meaningless given the differences in case difficulty.
MAN_2025 uses public-atlas images, not trial-enrolled patients. The CIP is explicit that the study subjects are the READERS, not the individuals in the images. ISO 14155 Annex E ethics-non-applicability determination is documented in R-TF-015-010 MAN_2025 instance.

Data / script provenance

Artefact	Location
MAN_2025 raw dataset	`apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data/`
MAN_2025 extract scripts	`apps/qms/scripts/fetch-man2025-sheets.mjs` · `apps/qms/scripts/build-man2025-dataset.py`
MAN_2025 analytics module	`apps/qms/src/components/Man2025/analytics.ts`
MAN_2025 renderers	`apps/qms/src/components/Man2025/{PrimaryOutcomeTable,AcceptanceCriteriaResultsTable,ReaderDemographicsTable}.tsx`
Shared acceptance-criteria SoT	`packages/ui/src/components/PerformanceClaimsAndClinicalBenefits/performanceClaims.ts` (rows M2N, M2A, M2R)
BI_2024 / PH_2024 / SAN_2024 CIRs	signed documents under `Investigation/{bi,ph,san}-2024/r-tf-015-006.mdx`
CER cross-reference	`legit-health-plus-version-1-1-0-0/.../Evaluation/R-TF-015-003-Clinical-Evaluation-Report.mdx` §Clinical performance
Statistical-summary CER appendix	`legit-health-plus-version-1-1-0-0/.../Evaluation/r-tf-015-013-statistical-summary.mdx`

How to refresh MAN_2025 numbers

# Pull latest sheet snapshot (service-account creds required)
node apps/qms/scripts/fetch-man2025-sheets.mjs

# Rebuild the de-identified dataset (writes data/*.json into the CIR folder)
python apps/qms/scripts/build-man2025-dataset.py

# Type-check and render
cd apps/qms && npx tsc --noEmit -p .
npm run start    # QMS on localhost:3000 — MAN_2025 CIR tables render from the JSON

Both the CIP's <AcceptanceCriteriaTable studyCode="MAN_2025" /> and the CIR's <Man2025AcceptanceCriteriaResultsTable /> pick up new thresholds / observed values automatically on the next build. No hand-editing of numerical tables anywhere in the MDX.

task-3b2-3b3-legacy-rwe-study/ — the real-world-evidence study (primary clinical data under Nick's hierarchy) that pairs with this MRMC programme.
task-3b13-man-2025-cep-cip-completeness/ — downstream CEP-row pull-through for MAN_2025.
task-3b14-ifu-integration-requirements-verification/ — integrator-responsibility mandate wording (Celine's agent check).

Study design at a glance​

Primary endpoint — paired top-1 accuracy improvement (pooled primary cohort)​

MAN_2025 computation provenance​

Secondary endpoints​

Sensitivity / specificity (HCP decision, unaided → aided)​

Specialty strata — paired top-1 accuracy, broken out by reader specialty​

Dermatologists (attending + residents, where applicable)​

Primary care / general practitioners (physicians)​

Nursing (MAN_2025 only — CIP-eligible category)​

MAN_2025 by qualification tier (fully-qualified attendings + senior nurses vs MIR residents)​

What the per-specialty patterns say​

Data provenance for the MAN_2025 per-specialty rows​

Stage 3 referral (MAN_2025 only — exploratory descriptive)​

Usability / workflow (SAN_2024 and PH_2024)​

Cross-study patterns worth calling out​

Caveats worth documenting​

Data / script provenance​

How to refresh MAN_2025 numbers​

Related internal workspaces​