MRMC cross-study comparison — BI_2024 · PH_2024 · SAN_2024 · MAN_2025
Scope: internal, audit-invisible. All four MRMC investigations that sit in the Plus technical file compared against the same endpoint template. Figures pulled from signed CIRs and, for MAN_2025, computed live from the locked de-identified dataset at
apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data/.Last refresh: 2026-04-20 (MAN_2025 data lock 2026-04-17; BI/PH/SAN values from their signed CIRs).
Primary endpoint template (all four studies): paired top-1 diagnostic accuracy, HCP unaided vs HCP aided by the device, pre-specified acceptance criterion ≥10 pp absolute improvement, McNemar two-sided p<0.05.
Study design at a glance
| Study | Purpose | CIP location | CIR location | Readers | Cases / image set | Paired obs | Design |
|---|---|---|---|---|---|---|---|
| BI_2024 | Rare dermatological diseases | Investigation/bi-2024/r-tf-015-005.mdx | Investigation/bi-2024/r-tf-015-006.mdx | 15 (11 PCP + 4 derm) | ~100 curated images, rare-disease focus | 1,449 | Self-controlled MRMC, paired unaided→aided, remote web platform |
| PH_2024 | Pigmented skin lesions / photographic | Investigation/ph-2024/r-tf-015-005.mdx | Investigation/ph-2024/r-tf-015-006.mdx | 9 | 30 image sets, 8 diagnostic classes | ~270 | Self-controlled MRMC, paired unaided→aided |
| SAN_2024 | General dermatology, mixed conditions | Investigation/san-2024/r-tf-015-005.mdx | Investigation/san-2024/r-tf-015-006.mdx | 16 (10 PCP + 6 derm) | 29 images × 16 readers (401 completed) | 401 | Prospective observational MRMC, remote web platform |
| MAN_2025 | Fitzpatrick V–VI phototypes only (PMCF) | Investigation/man-2025/r-tf-015-004.mdx | Investigation/man-2025/r-tf-015-006.mdx | 16 (primary) / 19 enrolled | 149 curated atlas images (all FP V–VI) | 2,376 | Three-stage MRMC (unaided → aided → referral), self-controlled |
Primary endpoint — paired top-1 accuracy improvement (pooled primary cohort)
| Study | Unaided | Aided | Δ (pp) | McNemar p | Acceptance | Status |
|---|---|---|---|---|---|---|
| BI_2024 | 47.94% | 63.06% | +15.12 | <0.001 | ≥10 pp | PASS |
| PH_2024 | 63.70% | 81.85% | +18.15 | <0.001 | ≥10 pp | PASS |
| SAN_2024 | 68.08% | 88.78% | +20.70 | <0.0001 | ≥10 pp | PASS |
| MAN_2025 | 41.79% | 65.07% | +23.27 | ≈1.0×10⁻¹⁰⁸ (χ² cc = 490.67) | ≥10 pp | PASS |
All four studies clear the ≥10 pp pre-specified bar. MAN_2025 has the largest improvement (+23.27 pp) and the lowest unaided baseline.
MAN_2025 computation provenance
The MAN_2025 numbers above are computed directly from the locked dataset in this repo:
# One-shot computation run from repo root on 2026-04-20
import json
base = 'apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data'
subs = json.load(open(f'{base}/submissions.json')) # 8542 rows, categories: diagnosis / assisted-diagnosis / referral
meta = json.load(open(f'{base}/meta.json')) # qualifiedReaders list (16 in primary cohort)
cases = json.load(open(f'{base}/cases.json')) # 149 cases
# Pair (readerCode, caseId) submissions for category ∈ {diagnosis, assisted-diagnosis}
# Compare each answer to case.correctCondition (exact-string)
# Result: 2376 paired obs · unaided 993/2376 (41.79%) · aided 1546/2376 (65.07%) · Δ +23.27 pp
# McNemar discordant pairs: b=34 (correct→incorrect) · c=587 (incorrect→correct) · χ² cc = 490.67
These numbers match what <Man2025PrimaryOutcomeTable /> and <Man2025AcceptanceCriteriaResultsTable /> render in the CIR at build time. The analytics live in apps/qms/src/components/Man2025/analytics.ts (deterministic, no network, unit-testable). The thresholds (≥10 pp, etc.) live once in packages/ui/src/components/PerformanceClaimsAndClinicalBenefits/performanceClaims.ts under studyId: "MAN_2025" rows (M2N, M2A, M2R). See apps/qms/src/components/Man2025/CLAUDE.md for the pipeline.
Secondary endpoints
Sensitivity / specificity (HCP decision, unaided → aided)
| Study | Sensitivity | Δ | Specificity | Δ | Notes |
|---|---|---|---|---|---|
| BI_2024 | 52.61% → 71.04% | +18.43 | 56.45% → 75.83% | +19.38 | Pooled across primary cohort |
| PH_2024 | 68.55% → 83.15% | +14.60 | 78.01% → 89.91% | +11.90 | Pooled across 9 readers |
| SAN_2024 | not separately reported — top-1 accuracy is the confirmatory endpoint | — | — | — | See SAN_2024 CIR §Secondary endpoints |
| MAN_2025 | malignant-case accuracy 56.88% → 78.75% (n=160 paired obs on 10 malignant cases) | +21.88 | — | — | Descriptive only — see §Stage 3 referral for referral-sensitivity |
Specialty strata — paired top-1 accuracy, broken out by reader specialty
The four studies were designed with different reader mixes, so cross-study specialty comparison is not a like-for-like contrast, but the shape of the result is consistent: the lower the unaided baseline, the larger the Δ from device assistance, across every specialty and every study.
Dermatologists (attending + residents, where applicable)
| Study | n readers | Paired obs | Unaided | Aided | Δ (pp) | Threshold | Status |
|---|---|---|---|---|---|---|---|
| BI_2024 | 4 (att.) | 400 | 57.25% | 65.65% | +8.39 | ≥5 pp | Directional, supportive (n=4 under-powered) |
| PH_2024 | 0 | — | — | — | — | — | No dermatologists in PH_2024 panel |
| SAN_2024 | 6 (att.) | 91 | 76.47% | 86.93% | +10.50 | ≥5 pp | PASS |
| MAN_2025 | 9 (3 att. + 6 res.) | 1,334 | 47.90% | 64.24% | +16.34 | — (pooled is the endpoint) | Exploratory; derm baseline is low because FP V–VI images are harder even for specialists |
Primary care / general practitioners (physicians)
| Study | n readers | Paired obs | Unaided | Aided | Δ (pp) | Threshold | Status |
|---|---|---|---|---|---|---|---|
| BI_2024 | 11 | 1,049 | 44.71% | 61.71% | +17.00 | ≥10 pp | PASS |
| PH_2024 | 9 | ~270 | 63.70% | 81.85% | +18.15 | ≥10 pp | PASS (PH_2024 is PCP-only by design) |
| SAN_2024 | 10 | 310 | 62.90% | 89.92% | +27.00 | ≥10 pp | PASS |
| MAN_2025 | 4 (1 att. + 3 res.) | 596 | 36.58% | 64.93% | +28.36 | — (pooled is the endpoint) | Exploratory; 4 primary-care readers in MAN_2025 |
Nursing (MAN_2025 only — CIP-eligible category)
| Study | n readers | Paired obs | Unaided | Aided | Δ (pp) |
|---|---|---|---|---|---|
| MAN_2025 | 3 | 446 | 30.49% | 67.71% | +37.22 |
Nursing was not admitted as a reader category in BI_2024, PH_2024 or SAN_2024 (those CIPs pre-date the CIP correction that explicitly admits licensed nurses with skin/wound scope). MAN_2025 is the first study in the programme to include them.
MAN_2025 by qualification tier (fully-qualified attendings + senior nurses vs MIR residents)
| Tier | n readers | Paired obs | Unaided | Aided | Δ (pp) |
|---|---|---|---|---|---|
| Fully-qualified-target (attendings + senior nurses) | 7 | 1,039 | 38.50% | 69.59% | +31.09 |
| Resident-target (MIR residents) | 9 | 1,337 | 44.35% | 61.56% | +17.20 |
Counter-intuitive at first glance: residents have a higher unaided baseline than attendings in MAN_2025. That's because the "fully-qualified" bucket mixes attending dermatologists, attending primary-care, and three licensed nurses with dermatology/wound scope — the nurse readers drag the unaided baseline down, and the device then pulls them up more than it pulls up the MIR residents. The CIR's sensitivity row reports this explicitly; the regulatory endpoint remains the pooled primary-cohort Δ of +23.27 pp.
What the per-specialty patterns say
-
Specialists converge aided. Dermatologists end up in a 64–87% aided-accuracy band across all three studies that recruited them. BI_2024 dermatologist aided (65.65%) and MAN_2025 dermatologist aided (64.24%) sit almost on top of each other despite the cohorts being entirely different — in BI it's rare pustular dermatoses on mostly European skin; in MAN it's common conditions on FP V–VI skin. The device plateau is similar.
-
PCPs benefit most in absolute terms. +17.00 (BI), +18.15 (PH), +27.00 (SAN), +28.36 (MAN) — every PCP stratum clears ≥10 pp with room to spare. MAN_2025 has the largest PCP Δ because the PCP baseline on FP V–VI is very low (36.58%).
-
Nursing is the strongest Δ of any stratum in the programme (+37.22 pp). That is directly regulatorily important: it supports the CEP claim that the device delivers value for the full intended-user population, not just physicians, and it supports Celine's Pillar-3 indirect-benefit causal chain (less-expert reader → more device benefit → better patient decision).
-
Cross-study specialty comparisons must be read carefully. The image sets differ; the conditions differ; reader experience distributions differ. You can compare shapes (derms always lowest Δ, PCPs always larger, nursing largest where present) but you cannot compare levels (SAN's 89.92% aided PCP accuracy on easy derm cases does not mean SAN's PCPs are better than MAN's PCPs; the cases are easier).
Data provenance for the MAN_2025 per-specialty rows
Computed from the locked dataset by bucketing qualified readers by readers.json → specialty:
| Specialty (onboarding form) | Qualified readers in MAN_2025 |
|---|---|
dermatology | R-01, R-04, R-05, R-07, R-08, R-09, R-10, R-12, R-19 (n=9) |
general (primary care) | R-02, R-13, R-14, R-18 (n=4) |
nursing | R-06, R-16, R-17 (n=3) |
Then the same (diagnosis, assisted-diagnosis) paired comparison as the primary endpoint, filtered to each specialty bucket. The code for this lives in apps/qms/src/components/Man2025/analytics.ts (filterReaders + computePaired); see apps/qms/src/components/Man2025/CLAUDE.md for cohort-API semantics.
Stage 3 referral (MAN_2025 only — exploratory descriptive)
| Metric | Result |
|---|---|
| Malignant cases referred for specialist review | 150/160 = 93.75% |
| Benign cases correctly NOT referred | 875/2220 = 39.41% |
| Device-level ROC AUC on malignancy (atlas truth) | 0.878 (10 malig / 139 benign) |
The stage 3 referral readout is descriptive only. 10 malignant cases is insufficient for a confirmatory malignancy-accuracy or malignancy-referral-sensitivity claim; that is delegated to the NMSC dedicated investigation and to the PMCF Plan (R-TF-007-002).
Usability / workflow (SAN_2024 and PH_2024)
| Study | No-referral rate | Remote-consult feasibility | Utility score |
|---|---|---|---|
| BI_2024 | not reported | not reported | not reported |
| PH_2024 | 48.89% | 60.74% | — |
| SAN_2024 | 58.1% | 55.11% | 8.0/10 usability · 7.3/10 diagnostic utility |
| MAN_2025 | referral data reported as Stage 3 descriptive (above) | — | — |
Cross-study patterns worth calling out
-
All four studies pass the ≥10 pp bar. Across four independent cohorts and four different reader panels (total 56 HCPs, 4,500+ paired observations), the device's core diagnostic-accuracy claim holds. This is the argument the CER uses in §Clinical performance — confirmatory MRMC programme.
-
Baseline accuracy tracks cohort difficulty exactly as expected.
- SAN_2024 (general derm, mostly FP I–III): highest unaided baseline 68.08%
- PH_2024 (pigmented lesions, experienced readers): 63.70%
- BI_2024 (rare diseases, harder diagnoses even for specialists): 47.94%
- MAN_2025 (FP V–VI, under-represented phototypes, harder for HCPs AND harder for the device): 41.79%
-
The harder the cohort, the larger the lift from the device. This is the main regulatory story: MAN_2025 has the lowest baseline and the largest Δ (+23.27 pp). BI_2024 rare-disease pooled stratum shows +32.32 pp. SAN_2024 PCP stratum shows +27.00 pp. The device delivers the most incremental value where the clinician is least confident. This supports the Pillar-3 clinical-performance claim and the indirect-benefit causal chain (see
celine-clinical-consultantagent). -
Aided performance converges across cohorts. Aided top-1 accuracy sits in a 63–89% band regardless of cohort difficulty. The spread narrows dramatically post-device (unaided range 41.79–68.08% → aided range 63.06–88.78%). The device compresses reader variance — a secondary argument that can be deployed in PMCF reasoning if reader-variance claims become relevant.
-
Statistical power is not the constraint in any of these. Even PH_2024 with n=9 readers produces p<0.001 on a self-controlled design. MAN_2025's 2,376 paired observations (χ² cc = 490.67, p ≈ 1×10⁻¹⁰⁸) make it by far the most statistically robust of the four.
-
BSI's "MRMC is not clinical data" stance (Nick, clarification meeting) does not hurt us. The four MRMC studies are framed as Rank 11 Pillar 3 simulated-use performance evidence — supporting, not primary. Primary real-world evidence is delivered by the legacy-device RWE study (
task-3b2-3b3-legacy-rwe-study/). The MRMC programme shows controlled-environment performance; the RWE study shows real-world performance. Together they cover both dimensions (see §Evidence-hierarchy positioning inCLAUDE.md).
Caveats worth documenting
- BI_2024 and PH_2024 per-pathology breakdowns are exploratory, not confirmatory. No multiple-testing correction was pre-specified; only aggregate pooled endpoints support the CE claim.
- SAN_2024 does NOT support Fitzpatrick V–VI claims (Fitzpatrick V = 1 image = 3.6%, Fitzpatrick VI = 0 images in its set). MAN_2025 exists specifically to close this gap. SAN_2024's limitations section makes this explicit.
- MAN_2025's malignancy readout is descriptive only. 10 malignant cases (7 melanoma, 3 BCC) are not enough for confirmatory malignancy claims. The NMSC study is the confirmatory source.
- Cross-study Δ comparisons are directionally meaningful, not like-for-like. The image sets differ by design; cohort baselines differ; reader panels differ in specialty mix. The common axis is the device and the ≥10 pp threshold — not the absolute numbers. Do not draw inferences like "MAN_2025 is 1.5× better than BI_2024"; that is meaningless given the differences in case difficulty.
- MAN_2025 uses public-atlas images, not trial-enrolled patients. The CIP is explicit that the study subjects are the READERS, not the individuals in the images. ISO 14155 Annex E ethics-non-applicability determination is documented in
R-TF-015-010MAN_2025 instance.
Data / script provenance
| Artefact | Location |
|---|---|
| MAN_2025 raw dataset | apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/clinical/Investigation/man-2025/data/ |
| MAN_2025 extract scripts | apps/qms/scripts/fetch-man2025-sheets.mjs · apps/qms/scripts/build-man2025-dataset.py |
| MAN_2025 analytics module | apps/qms/src/components/Man2025/analytics.ts |
| MAN_2025 renderers | apps/qms/src/components/Man2025/{PrimaryOutcomeTable,AcceptanceCriteriaResultsTable,ReaderDemographicsTable}.tsx |
| Shared acceptance-criteria SoT | packages/ui/src/components/PerformanceClaimsAndClinicalBenefits/performanceClaims.ts (rows M2N, M2A, M2R) |
| BI_2024 / PH_2024 / SAN_2024 CIRs | signed documents under Investigation/{bi,ph,san}-2024/r-tf-015-006.mdx |
| CER cross-reference | legit-health-plus-version-1-1-0-0/.../Evaluation/R-TF-015-003-Clinical-Evaluation-Report.mdx §Clinical performance |
| Statistical-summary CER appendix | legit-health-plus-version-1-1-0-0/.../Evaluation/r-tf-015-013-statistical-summary.mdx |
How to refresh MAN_2025 numbers
# Pull latest sheet snapshot (service-account creds required)
node apps/qms/scripts/fetch-man2025-sheets.mjs
# Rebuild the de-identified dataset (writes data/*.json into the CIR folder)
python apps/qms/scripts/build-man2025-dataset.py
# Type-check and render
cd apps/qms && npx tsc --noEmit -p .
npm run start # QMS on localhost:3000 — MAN_2025 CIR tables render from the JSON
Both the CIP's <AcceptanceCriteriaTable studyCode="MAN_2025" /> and the CIR's <Man2025AcceptanceCriteriaResultsTable /> pick up new thresholds / observed values automatically on the next build. No hand-editing of numerical tables anywhere in the MDX.
Related internal workspaces
- task-3b2-3b3-legacy-rwe-study/ — the real-world-evidence study (primary clinical data under Nick's hierarchy) that pairs with this MRMC programme.
- task-3b13-man-2025-cep-cip-completeness/ — downstream CEP-row pull-through for MAN_2025.
- task-3b14-ifu-integration-requirements-verification/ — integrator-responsibility mandate wording (Celine's agent check).