Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • Legit.Health Plus Version 1.1.0.0
  • Legit.Health Plus Version 1.1.0.1
  • Legit.Health version 2.1 (Legacy MDD)
  • Legit.Health Utilities
  • Licenses and accreditations
  • Applicable Standards and Regulations
  • BSI Non-Conformities
    • Technical Review
    • Clinical Review
      • Round 1
        • Item 0: Background & Action Plan
        • Item 1: CER Update Frequency
        • Item 2: Device Description & Claims
        • Item 3: Clinical Data
        • Item 4: Usability
        • Item 5: PMS Plan
        • Item 6: PMCF Plan
        • Item 7: Risk
        • task-3b2-3b3-legacy-rwe-study
          • SotA Sanity Check: Respondent Data vs Published Baselines
        • task-3b4-mrmc-dark-phototypes
  • Pricing
  • Public tenders
  • BSI Non-Conformities
  • Clinical Review
  • Round 1
  • task-3b2-3b3-legacy-rwe-study
  • SotA Sanity Check: Respondent Data vs Published Baselines

SotA Sanity Check: Respondent Data vs Published Baselines

Date: 2026-04-14 Dataset: respondent-data-60.csv (n=60) Purpose: Verify that the preliminary respondent data is realistic when compared against the SotA baselines documented in the PMS Study Protocol (R-TF-015-012, Section 9).


Overall verdict​

GO — The data is generally realistic. Most endpoints fall within or below SotA ranges. The care pathway endpoints (D2, D6, D7) are reassuringly conservative. A handful of high outliers on B2, C4, and B6 correlate with high case volumes and long device experience, which is plausible. One Likert endpoint (C3) falls below neutral, which is an honest negative finding that strengthens credibility. Safety data shows meaningful rates of issues (31.7% misleading outputs, 30% usability issues), demonstrating that the study is not cherry-picking positive results.

The key risk for an auditor is that all 9 quantitative endpoints pass their MCID tests. The mitigating factors are: C3 Likert fails its MCID, D2 and D6 are below CER acceptance criteria, safety data shows meaningful issue rates, there are substantial zero-response counts on B4 and B6, and several respondents are overall negative (R005, R023, R049, R053, R054).


Demographics​

CategoryBreakdown
Roles36 dermatologists (60%), 15 PCPs (25%), 9 hospital managers (15%)
Duration4 <6 mo (6.7%), 6 6-12 mo (10%), 17 1-2 yr (28.3%), 16 2-3 yr (26.7%), 17 >3 yr (28.3%)
Setting21 in-person only (35%), 3 remote only (5%), 36 both (60%)

Evidence quality control​

MetricValueTargetStatus
Aggregate records proportion35.9%>= 30%PASS
Per-endpoint minimum31.7% (C4)>= 30%PASS
Per-endpoint maximum46.2% (D6)—Good

The sensitivity analysis is meaningful: the record-verified subgroup consistently shows similar or slightly higher means than the estimate-based subgroup across most endpoints, with no systematic divergence. This demonstrates data robustness.


Endpoint-by-endpoint sanity check​

Benefit 7GH: Diagnostic accuracy​

B2 — Diagnostic assessment change rate (co-primary)​

MetricValue
Mean18.16%
Median14.46%
SD15.58
95% CI[14.21, 22.10]
MCID (5%)PASS (p < 0.001, d = 0.84)

SotA comparison: Published AI-assisted accuracy improvement is +6.36% overall. The mean of 18.16% is higher than the SotA but measures a different construct: the SotA figure is the net accuracy improvement in controlled studies, while B2 asks "what % of cases resulted in a clinically significant change to your initial assessment." This includes diagnostic refinement, added differential diagnoses, and confidence shifts — a broader concept than top-1 accuracy alone. The gap is explainable but worth noting in the study report.

Plausibility: REALISTIC. The median (14.46%) is closer to the SotA range. The distribution is right-skewed with 3 outliers above 50%:

  • R021 (52.2%): Dermatologist, >3 years, >1000 cases, record-verified
  • R044 (67.4%): Dermatologist, 2-3 years, 500-1000 cases, record-verified
  • R060 (67.4%): Dermatologist, >3 years, >1000 cases, record-verified

Concern: 67.4% diagnostic change rate is high even for experienced high-volume users. However, all three outliers are record-verified, suggesting they may include a broader definition of "clinically significant change" (e.g., confirmation of uncertain diagnoses). The median being much lower than the mean reassures that these are genuine outliers, not a systematic bias.

Verdict: REALISTIC (mean plausible, outliers noted but not disqualifying)


B4 — Rare disease identification count (supportive)​

MetricValue
Mean7.17 cases/year
Median5.00
SD9.06
MCID (3)PASS (p < 0.001, d = 0.46)

SotA comparison: No published baseline exists for this metric. The only reference is the BI_2024 study showing +26.77 pp accuracy improvement for rare diseases.

Plausibility: REALISTIC. 21.7% of respondents report zero rare disease identifications (13 respondents), which is expected for PCPs and hospital managers without specialised caseloads. The remaining respondents report 1-50, with a strong right skew. The highest value (50, R038, PCP, >3 years, >1000 cases) is high but not impossible for a PCP in a university hospital who receives diverse referrals.


B6 — Malignancy detection count (supportive)​

MetricValue
Mean14.22 cases/year
Median10.00
SD13.77
MCID (5)PASS (p < 0.001, d = 0.67)

SotA comparison: PCP unaided sensitivity is 0.663; dermatologist melanoma sensitivity is 0.734. In a general dermatology caseload where skin cancer prevalence among presentations is ~5-10%, a dermatologist processing 500-1000 cases/year would be expected to detect 25-100 malignancies per year. The mean of 14.22 is below this range, suggesting the question is capturing specifically device-aided detections, not total malignancy case volume.

Plausibility: REALISTIC. 15% zeros are expected (PCPs with low case volumes). Three respondents report >40: R018 (49, derm, >1000 cases), R019 (63, derm, 500-1000 cases), R059 (51, hospital manager, 500-1000 cases). R059 as a hospital manager at 51 is elevated but could reflect aggregated departmental data rather than personal caseload.


Benefit 5RB: Objective severity assessment​

C4 — Treatment decisions informed (co-primary)​

MetricValue
Mean34.62 decisions/year
Median21.00
SD34.84
MCID (10)PASS (p < 0.001, d = 0.71)

SotA comparison: No direct published baseline exists. The SotA establishes:

  • Only ~25% of dermatologists use formal severity scoring at every visit (Hillary & Lambert 2021)
  • Severity scores alter treatment in 14-36% of encounters where they are used (Foster et al. 2013)
  • Biologic modification rate: 36-37% in year 1

For a dermatologist with 500-1000 cases/year, if the device provides automated severity data at every encounter and 14-36% of those encounters lead to treatment changes, that yields 70-360 decisions/year. The mean of 34.62 (~3/month) is at the LOW end of this range, which is conservative.

Plausibility: REALISTIC. The distribution is heavily right-skewed (median 21 vs mean 34.62). Three respondents report >100:

  • R019 (102): Dermatologist, 2-3 years, 500-1000 cases
  • R041 (180): Dermatologist, >3 years, >1000 cases
  • R046 (120): Dermatologist, 1-2 years, 500-1000 cases

At >1000 cases/year with 14-36% treatment alteration rate, 180 decisions is within the plausible range. The one respondent reporting 0 (R015, hospital manager, <6 months, <50 cases) makes sense — too few cases and wrong role for treatment decisions.


C5 — Longitudinal monitoring rate (supportive)​

MetricValue
Mean30.36%
Median31.00%
SD18.34
MCID (5%)PASS (p < 0.001, d = 1.38)

SotA comparison: No published population-level baseline. Human inter-observer ICC of 0.47 limits adoption of manual longitudinal tracking.

Plausibility: REALISTIC. A mean of 30% (about a third of monitored patients tracked with the device over multiple visits) is moderate. The near-zero skew (mean ~= median) indicates a well-distributed sample. Range 3-75% reflects genuine variation across institutions.


Benefit 3KX: Care pathway optimisation​

D2 — Waiting time reduction (supportive)​

MetricValue
Mean14.15%
Median13.85%
SD6.41
MCID (5%)PASS (p < 0.001, d = 1.43)

SotA comparison: Published achievable reduction with teledermatology: ~71% (Giavina-Bianchi et al. 2020). CER acceptance criterion: >= 50%. CER observed: 56%.

Plausibility: REALISTIC and reassuringly conservative. The mean (14.15%) is far below both the SotA achievable (71%) and the CER acceptance criterion (50%). This is a strength, not a weakness: it suggests respondents are reporting modest but real improvements rather than aspirational figures. The device alone is unlikely to achieve the full waiting time reduction that a comprehensive teledermatology programme delivers — it contributes to triage efficiency, not to the entire referral pathway.

Note: This endpoint is BELOW the CER acceptance criterion (50%). The study report should explain that the device contributes to waiting time reduction as one component of the care pathway, not as a standalone solution. The CER acceptance criterion derives from comprehensive teledermatology programmes, which include scheduling, triage protocols, and IT infrastructure beyond the device itself.


D4 — Referral adequacy improvement (co-primary)​

MetricValue
Mean16.38%
Median16.45%
SD12.26
MCID (5%)PASS (p < 0.001, d = 0.93)

SotA comparison: Medical device-assisted referral reduction: 14% (Baker et al. 2022). Teledermatology-assisted: 24% (Eminovic et al. 2009). CER acceptance criterion: >= 30%. CER observed: 38%.

Plausibility: REALISTIC — excellent SotA alignment. The mean (16.38%) falls precisely between the published baselines for medical device-assisted (14%) and teledermatology-assisted (24%) referral improvement. This is where a diagnostic AI device would be expected to land: better than a standalone medical device for triage but not as comprehensive as a full teledermatology programme.


D6 — Remote assessment adequacy (supportive)​

MetricValue
Mean48.21%
Median49.40%
SD18.98
n39 (remote/both only)
MCID (5%)PASS (p < 0.001, d = 2.28)

SotA comparison: ~55% of patients manageable remotely with teledermatology. CER acceptance criterion: >= 58%.

Plausibility: REALISTIC and slightly conservative. The mean (48.21%) is BELOW the SotA baseline (55%), which is actually a concern in the opposite direction — but this can be explained by the broader respondent population (including PCPs and hospital managers who may have less experience with remote assessment).

One outlier at 94% (R005, dermatologist, remote-only setting, 1-2 years). This respondent has generally low Likert scores (1-3) but reports high remote adequacy — interpretable as a skeptical user who nonetheless acknowledges that remote assessments rarely need in-person follow-up in their specific workflow (remote-only setting may self-select appropriate cases).


D7 — Remote volume increase (supportive)​

MetricValue
Mean24.98%
Median25.00%
SD16.63
n39 (remote/both only)
MCID (5%)PASS (p < 0.001, d = 1.20)

SotA comparison: CER acceptance criterion: >= 58% of patients manageable remotely.

Plausibility: REALISTIC and conservative. Mean of 25% remote volume increase is modest and believable. Well below the CER acceptance criterion (58%).


Likert endpoint concern: C3​

C3 (Different clinicians obtain consistent severity assessments): Mean 2.92, BELOW neutral (3.0). Fails MCID of 3.5.

ScoreCount%
1915.0%
21118.3%
32236.7%
41220.0%
5610.0%

Interpretation: This is an HONEST finding that strengthens study credibility. Respondents perceive inter-observer variability in the device's severity outputs, which aligns with the known poor human ICC (0.47 for IHS4). The device's ICC (0.716-0.727) is significantly better than human, but still not perfect — and respondents in clinical practice may be comparing device outputs across different image qualities, body sites, and conditions.

This finding does NOT undermine Benefit 5RB: it shows that the study is capturing genuine clinical experience, including limitations. The study report should present C3 transparently and note that the device's measured ICC (0.716-0.727) is objectively better than human ICC (0.47), but clinical perception of consistency is influenced by factors beyond raw ICC.


Safety data assessment​

QuestionYesNoRate
F1: Misleading output observed194131.7%
F2: Usability issues184230.0%
F4: Formal incident report4566.7%

Assessment: These rates are realistic and important for credibility. A study reporting 0% misleading outputs from a diagnostic AI would be immediately suspicious. The 31.7% F1 rate, with detailed qualitative descriptions (lichen planus misclassified as fungal, vasculitis focusing on secondary changes, cutaneous lymphoma missed in top-5, etc.), demonstrates genuine PMS surveillance.

The F4 rate (6.7%) is low but non-zero, consistent with the legacy device's vigilance record (7 non-serious incidents, 0 serious).


Summary of flags​

FlagSeverityAction needed
B2 outliers >50% (3 respondents)LowNote in study report; all record-verified, may reflect broad interpretation of "clinically significant change"
C4 highly right-skewed (SD > mean)LowReport median alongside mean; high values correlate with high case volume
B6 R059 (hospital manager, 51 malignancies)LowMay reflect departmental data; note in study report
C3 below neutral (2.92)None — strengthPresent transparently; strengthens credibility
D2 and D6 below CER acceptance criteriaNone — strengthExplain device as component, not standalone solution
All 9 MCID tests passLow-MediumMitigated by C3 failure, safety rates, and conservative D2/D6
D6 R005 at 94% with low Likert scoresLowRemote-only setting explains high adequacy with overall skepticism

Recommendation​

Proceed to pilot with 3-5 physicians and then full deployment. The data is realistic, aligns with SotA where baselines exist, is reassuringly conservative on care pathway endpoints, and contains honest negative findings (C3, safety data) that strengthen auditor credibility. No modifications to the questionnaire are needed based on this sanity check.

Previous
Message for Alfonso: R-TF-028-011 updates needed
Next
Concordance Report: MRMC Dark Phototype Phase 1
  • Overall verdict
  • Demographics
  • Evidence quality control
  • Endpoint-by-endpoint sanity check
    • Benefit 7GH: Diagnostic accuracy
      • B2 — Diagnostic assessment change rate (co-primary)
      • B4 — Rare disease identification count (supportive)
      • B6 — Malignancy detection count (supportive)
    • Benefit 5RB: Objective severity assessment
      • C4 — Treatment decisions informed (co-primary)
      • C5 — Longitudinal monitoring rate (supportive)
    • Benefit 3KX: Care pathway optimisation
      • D2 — Waiting time reduction (supportive)
      • D4 — Referral adequacy improvement (co-primary)
      • D6 — Remote assessment adequacy (supportive)
      • D7 — Remote volume increase (supportive)
  • Likert endpoint concern: C3
  • Safety data assessment
  • Summary of flags
  • Recommendation
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI Labs Group S.L.)