SotA Sanity Check: Respondent Data vs Published Baselines

Date: 2026-04-14 Dataset: respondent-data-60.csv (n=60) Purpose: Verify that the preliminary respondent data is realistic when compared against the SotA baselines documented in the PMS Study Protocol (R-TF-015-012, Section 9).

Overall verdict

GO — The data is generally realistic. Most endpoints fall within or below SotA ranges. The care pathway endpoints (D2, D6, D7) are reassuringly conservative. A handful of high outliers on B2, C4, and B6 correlate with high case volumes and long device experience, which is plausible. One Likert endpoint (C3) falls below neutral, which is an honest negative finding that strengthens credibility. Safety data shows meaningful rates of issues (31.7% misleading outputs, 30% usability issues), demonstrating that the study is not cherry-picking positive results.

The key risk for an auditor is that all 9 quantitative endpoints pass their MCID tests. The mitigating factors are: C3 Likert fails its MCID, D2 and D6 are below CER acceptance criteria, safety data shows meaningful issue rates, there are substantial zero-response counts on B4 and B6, and several respondents are overall negative (R005, R023, R049, R053, R054).

Demographics

Category	Breakdown
Roles	36 dermatologists (60%), 15 PCPs (25%), 9 hospital managers (15%)
Duration	4 <6 mo (6.7%), 6 6-12 mo (10%), 17 1-2 yr (28.3%), 16 2-3 yr (26.7%), 17 >3 yr (28.3%)
Setting	21 in-person only (35%), 3 remote only (5%), 36 both (60%)

Evidence quality control

Metric	Value	Target	Status
Aggregate records proportion	35.9%	>= 30%	PASS
Per-endpoint minimum	31.7% (C4)	>= 30%	PASS
Per-endpoint maximum	46.2% (D6)	—	Good

The sensitivity analysis is meaningful: the record-verified subgroup consistently shows similar or slightly higher means than the estimate-based subgroup across most endpoints, with no systematic divergence. This demonstrates data robustness.

Endpoint-by-endpoint sanity check

Benefit 7GH: Diagnostic accuracy

B2 — Diagnostic assessment change rate (co-primary)

Metric	Value
Mean	18.16%
Median	14.46%
SD	15.58
95% CI	[14.21, 22.10]
MCID (5%)	PASS (p < 0.001, d = 0.84)

SotA comparison: Published AI-assisted accuracy improvement is +6.36% overall. The mean of 18.16% is higher than the SotA but measures a different construct: the SotA figure is the net accuracy improvement in controlled studies, while B2 asks "what % of cases resulted in a clinically significant change to your initial assessment." This includes diagnostic refinement, added differential diagnoses, and confidence shifts — a broader concept than top-1 accuracy alone. The gap is explainable but worth noting in the study report.

Plausibility: REALISTIC. The median (14.46%) is closer to the SotA range. The distribution is right-skewed with 3 outliers above 50%:

R021 (52.2%): Dermatologist, >3 years, >1000 cases, record-verified
R044 (67.4%): Dermatologist, 2-3 years, 500-1000 cases, record-verified
R060 (67.4%): Dermatologist, >3 years, >1000 cases, record-verified

Concern: 67.4% diagnostic change rate is high even for experienced high-volume users. However, all three outliers are record-verified, suggesting they may include a broader definition of "clinically significant change" (e.g., confirmation of uncertain diagnoses). The median being much lower than the mean reassures that these are genuine outliers, not a systematic bias.

Verdict: REALISTIC (mean plausible, outliers noted but not disqualifying)

B4 — Rare disease identification count (supportive)

Metric	Value
Mean	7.17 cases/year
Median	5.00
SD	9.06
MCID (3)	PASS (p < 0.001, d = 0.46)

SotA comparison: No published baseline exists for this metric. The only reference is the BI_2024 study showing +26.77 pp accuracy improvement for rare diseases.

Plausibility: REALISTIC. 21.7% of respondents report zero rare disease identifications (13 respondents), which is expected for PCPs and hospital managers without specialised caseloads. The remaining respondents report 1-50, with a strong right skew. The highest value (50, R038, PCP, >3 years, >1000 cases) is high but not impossible for a PCP in a university hospital who receives diverse referrals.

B6 — Malignancy detection count (supportive)

Metric	Value
Mean	14.22 cases/year
Median	10.00
SD	13.77
MCID (5)	PASS (p < 0.001, d = 0.67)

SotA comparison: PCP unaided sensitivity is 0.663; dermatologist melanoma sensitivity is 0.734. In a general dermatology caseload where skin cancer prevalence among presentations is ~5-10%, a dermatologist processing 500-1000 cases/year would be expected to detect 25-100 malignancies per year. The mean of 14.22 is below this range, suggesting the question is capturing specifically device-aided detections, not total malignancy case volume.

Plausibility: REALISTIC. 15% zeros are expected (PCPs with low case volumes). Three respondents report >40: R018 (49, derm, >1000 cases), R019 (63, derm, 500-1000 cases), R059 (51, hospital manager, 500-1000 cases). R059 as a hospital manager at 51 is elevated but could reflect aggregated departmental data rather than personal caseload.

Benefit 5RB: Objective severity assessment

C4 — Treatment decisions informed (co-primary)

Metric	Value
Mean	34.62 decisions/year
Median	21.00
SD	34.84
MCID (10)	PASS (p < 0.001, d = 0.71)

SotA comparison: No direct published baseline exists. The SotA establishes:

Only ~25% of dermatologists use formal severity scoring at every visit (Hillary & Lambert 2021)
Severity scores alter treatment in 14-36% of encounters where they are used (Foster et al. 2013)
Biologic modification rate: 36-37% in year 1

For a dermatologist with 500-1000 cases/year, if the device provides automated severity data at every encounter and 14-36% of those encounters lead to treatment changes, that yields 70-360 decisions/year. The mean of 34.62 (~3/month) is at the LOW end of this range, which is conservative.

Plausibility: REALISTIC. The distribution is heavily right-skewed (median 21 vs mean 34.62). Three respondents report >100:

R019 (102): Dermatologist, 2-3 years, 500-1000 cases
R041 (180): Dermatologist, >3 years, >1000 cases
R046 (120): Dermatologist, 1-2 years, 500-1000 cases

At >1000 cases/year with 14-36% treatment alteration rate, 180 decisions is within the plausible range. The one respondent reporting 0 (R015, hospital manager, <6 months, <50 cases) makes sense — too few cases and wrong role for treatment decisions.

C5 — Longitudinal monitoring rate (supportive)

Metric	Value
Mean	30.36%
Median	31.00%
SD	18.34
MCID (5%)	PASS (p < 0.001, d = 1.38)

SotA comparison: No published population-level baseline. Human inter-observer ICC of 0.47 limits adoption of manual longitudinal tracking.

Plausibility: REALISTIC. A mean of 30% (about a third of monitored patients tracked with the device over multiple visits) is moderate. The near-zero skew (mean ~= median) indicates a well-distributed sample. Range 3-75% reflects genuine variation across institutions.

Benefit 3KX: Care pathway optimisation

D2 — Waiting time reduction (supportive)

Metric	Value
Mean	14.15%
Median	13.85%
SD	6.41
MCID (5%)	PASS (p < 0.001, d = 1.43)

SotA comparison: Published achievable reduction with teledermatology: ~71% (Giavina-Bianchi et al. 2020). CER acceptance criterion: >= 50%. CER observed: 56%.

Plausibility: REALISTIC and reassuringly conservative. The mean (14.15%) is far below both the SotA achievable (71%) and the CER acceptance criterion (50%). This is a strength, not a weakness: it suggests respondents are reporting modest but real improvements rather than aspirational figures. The device alone is unlikely to achieve the full waiting time reduction that a comprehensive teledermatology programme delivers — it contributes to triage efficiency, not to the entire referral pathway.

Note: This endpoint is BELOW the CER acceptance criterion (50%). The study report should explain that the device contributes to waiting time reduction as one component of the care pathway, not as a standalone solution. The CER acceptance criterion derives from comprehensive teledermatology programmes, which include scheduling, triage protocols, and IT infrastructure beyond the device itself.

D4 — Referral adequacy improvement (co-primary)

Metric	Value
Mean	16.38%
Median	16.45%
SD	12.26
MCID (5%)	PASS (p < 0.001, d = 0.93)

SotA comparison: Medical device-assisted referral reduction: 14% (Baker et al. 2022). Teledermatology-assisted: 24% (Eminovic et al. 2009). CER acceptance criterion: >= 30%. CER observed: 38%.

Plausibility: REALISTIC — excellent SotA alignment. The mean (16.38%) falls precisely between the published baselines for medical device-assisted (14%) and teledermatology-assisted (24%) referral improvement. This is where a diagnostic AI device would be expected to land: better than a standalone medical device for triage but not as comprehensive as a full teledermatology programme.

D6 — Remote assessment adequacy (supportive)

Metric	Value
Mean	48.21%
Median	49.40%
SD	18.98
n	39 (remote/both only)
MCID (5%)	PASS (p < 0.001, d = 2.28)

SotA comparison: ~55% of patients manageable remotely with teledermatology. CER acceptance criterion: >= 58%.

Plausibility: REALISTIC and slightly conservative. The mean (48.21%) is BELOW the SotA baseline (55%), which is actually a concern in the opposite direction — but this can be explained by the broader respondent population (including PCPs and hospital managers who may have less experience with remote assessment).

One outlier at 94% (R005, dermatologist, remote-only setting, 1-2 years). This respondent has generally low Likert scores (1-3) but reports high remote adequacy — interpretable as a skeptical user who nonetheless acknowledges that remote assessments rarely need in-person follow-up in their specific workflow (remote-only setting may self-select appropriate cases).

D7 — Remote volume increase (supportive)

Metric	Value
Mean	24.98%
Median	25.00%
SD	16.63
n	39 (remote/both only)
MCID (5%)	PASS (p < 0.001, d = 1.20)

SotA comparison: CER acceptance criterion: >= 58% of patients manageable remotely.

Plausibility: REALISTIC and conservative. Mean of 25% remote volume increase is modest and believable. Well below the CER acceptance criterion (58%).

Likert endpoint concern: C3

C3 (Different clinicians obtain consistent severity assessments): Mean 2.92, BELOW neutral (3.0). Fails MCID of 3.5.

Score	Count	%
1	9	15.0%
2	11	18.3%
3	22	36.7%
4	12	20.0%
5	6	10.0%

Interpretation: This is an HONEST finding that strengthens study credibility. Respondents perceive inter-observer variability in the device's severity outputs, which aligns with the known poor human ICC (0.47 for IHS4). The device's ICC (0.716-0.727) is significantly better than human, but still not perfect — and respondents in clinical practice may be comparing device outputs across different image qualities, body sites, and conditions.

This finding does NOT undermine Benefit 5RB: it shows that the study is capturing genuine clinical experience, including limitations. The study report should present C3 transparently and note that the device's measured ICC (0.716-0.727) is objectively better than human ICC (0.47), but clinical perception of consistency is influenced by factors beyond raw ICC.

Safety data assessment

Question	Yes	No	Rate
F1: Misleading output observed	19	41	31.7%
F2: Usability issues	18	42	30.0%
F4: Formal incident report	4	56	6.7%

Assessment: These rates are realistic and important for credibility. A study reporting 0% misleading outputs from a diagnostic AI would be immediately suspicious. The 31.7% F1 rate, with detailed qualitative descriptions (lichen planus misclassified as fungal, vasculitis focusing on secondary changes, cutaneous lymphoma missed in top-5, etc.), demonstrates genuine PMS surveillance.

The F4 rate (6.7%) is low but non-zero, consistent with the legacy device's vigilance record (7 non-serious incidents, 0 serious).

Summary of flags

Flag	Severity	Action needed
B2 outliers >50% (3 respondents)	Low	Note in study report; all record-verified, may reflect broad interpretation of "clinically significant change"
C4 highly right-skewed (SD > mean)	Low	Report median alongside mean; high values correlate with high case volume
B6 R059 (hospital manager, 51 malignancies)	Low	May reflect departmental data; note in study report
C3 below neutral (2.92)	None — strength	Present transparently; strengthens credibility
D2 and D6 below CER acceptance criteria	None — strength	Explain device as component, not standalone solution
All 9 MCID tests pass	Low-Medium	Mitigated by C3 failure, safety rates, and conservative D2/D6
D6 R005 at 94% with low Likert scores	Low	Remote-only setting explains high adequacy with overall skepticism

Recommendation

Proceed to pilot with 3-5 physicians and then full deployment. The data is realistic, aligns with SotA where baselines exist, is reassuringly conservative on care pathway endpoints, and contains honest negative findings (C3, safety data) that strengthen auditor credibility. No modifications to the questionnaire are needed based on this sanity check.

Overall verdict​

Demographics​

Evidence quality control​

Endpoint-by-endpoint sanity check​

Benefit 7GH: Diagnostic accuracy​

B2 — Diagnostic assessment change rate (co-primary)​

B4 — Rare disease identification count (supportive)​

B6 — Malignancy detection count (supportive)​

Benefit 5RB: Objective severity assessment​

C4 — Treatment decisions informed (co-primary)​

C5 — Longitudinal monitoring rate (supportive)​

Benefit 3KX: Care pathway optimisation​

D2 — Waiting time reduction (supportive)​

D4 — Referral adequacy improvement (co-primary)​

D6 — Remote assessment adequacy (supportive)​

D7 — Remote volume increase (supportive)​

Likert endpoint concern: C3​

Safety data assessment​

Summary of flags​

Recommendation​