SotA Sanity Check: Respondent Data vs Published Baselines
Date: 2026-04-14
Dataset: respondent-data-60.csv (n=60)
Purpose: Verify that the preliminary respondent data is realistic when compared against the SotA baselines documented in the PMS Study Protocol (R-TF-015-012, Section 9).
Overall verdict
GO — The data is generally realistic. Most endpoints fall within or below SotA ranges. The care pathway endpoints (D2, D6, D7) are reassuringly conservative. A handful of high outliers on B2, C4, and B6 correlate with high case volumes and long device experience, which is plausible. One Likert endpoint (C3) falls below neutral, which is an honest negative finding that strengthens credibility. Safety data shows meaningful rates of issues (31.7% misleading outputs, 30% usability issues), demonstrating that the study is not cherry-picking positive results.
The key risk for an auditor is that all 9 quantitative endpoints pass their MCID tests. The mitigating factors are: C3 Likert fails its MCID, D2 and D6 are below CER acceptance criteria, safety data shows meaningful issue rates, there are substantial zero-response counts on B4 and B6, and several respondents are overall negative (R005, R023, R049, R053, R054).
Demographics
| Category | Breakdown |
|---|---|
| Roles | 36 dermatologists (60%), 15 PCPs (25%), 9 hospital managers (15%) |
| Duration | 4 <6 mo (6.7%), 6 6-12 mo (10%), 17 1-2 yr (28.3%), 16 2-3 yr (26.7%), 17 >3 yr (28.3%) |
| Setting | 21 in-person only (35%), 3 remote only (5%), 36 both (60%) |
Evidence quality control
| Metric | Value | Target | Status |
|---|---|---|---|
| Aggregate records proportion | 35.9% | >= 30% | PASS |
| Per-endpoint minimum | 31.7% (C4) | >= 30% | PASS |
| Per-endpoint maximum | 46.2% (D6) | — | Good |
The sensitivity analysis is meaningful: the record-verified subgroup consistently shows similar or slightly higher means than the estimate-based subgroup across most endpoints, with no systematic divergence. This demonstrates data robustness.
Endpoint-by-endpoint sanity check
Benefit 7GH: Diagnostic accuracy
B2 — Diagnostic assessment change rate (co-primary)
| Metric | Value |
|---|---|
| Mean | 18.16% |
| Median | 14.46% |
| SD | 15.58 |
| 95% CI | [14.21, 22.10] |
| MCID (5%) | PASS (p < 0.001, d = 0.84) |
SotA comparison: Published AI-assisted accuracy improvement is +6.36% overall. The mean of 18.16% is higher than the SotA but measures a different construct: the SotA figure is the net accuracy improvement in controlled studies, while B2 asks "what % of cases resulted in a clinically significant change to your initial assessment." This includes diagnostic refinement, added differential diagnoses, and confidence shifts — a broader concept than top-1 accuracy alone. The gap is explainable but worth noting in the study report.
Plausibility: REALISTIC. The median (14.46%) is closer to the SotA range. The distribution is right-skewed with 3 outliers above 50%:
- R021 (52.2%): Dermatologist, >3 years, >1000 cases, record-verified
- R044 (67.4%): Dermatologist, 2-3 years, 500-1000 cases, record-verified
- R060 (67.4%): Dermatologist, >3 years, >1000 cases, record-verified
Concern: 67.4% diagnostic change rate is high even for experienced high-volume users. However, all three outliers are record-verified, suggesting they may include a broader definition of "clinically significant change" (e.g., confirmation of uncertain diagnoses). The median being much lower than the mean reassures that these are genuine outliers, not a systematic bias.
Verdict: REALISTIC (mean plausible, outliers noted but not disqualifying)
B4 — Rare disease identification count (supportive)
| Metric | Value |
|---|---|
| Mean | 7.17 cases/year |
| Median | 5.00 |
| SD | 9.06 |
| MCID (3) | PASS (p < 0.001, d = 0.46) |
SotA comparison: No published baseline exists for this metric. The only reference is the BI_2024 study showing +26.77 pp accuracy improvement for rare diseases.
Plausibility: REALISTIC. 21.7% of respondents report zero rare disease identifications (13 respondents), which is expected for PCPs and hospital managers without specialised caseloads. The remaining respondents report 1-50, with a strong right skew. The highest value (50, R038, PCP, >3 years, >1000 cases) is high but not impossible for a PCP in a university hospital who receives diverse referrals.
B6 — Malignancy detection count (supportive)
| Metric | Value |
|---|---|
| Mean | 14.22 cases/year |
| Median | 10.00 |
| SD | 13.77 |
| MCID (5) | PASS (p < 0.001, d = 0.67) |
SotA comparison: PCP unaided sensitivity is 0.663; dermatologist melanoma sensitivity is 0.734. In a general dermatology caseload where skin cancer prevalence among presentations is ~5-10%, a dermatologist processing 500-1000 cases/year would be expected to detect 25-100 malignancies per year. The mean of 14.22 is below this range, suggesting the question is capturing specifically device-aided detections, not total malignancy case volume.
Plausibility: REALISTIC. 15% zeros are expected (PCPs with low case volumes). Three respondents report >40: R018 (49, derm, >1000 cases), R019 (63, derm, 500-1000 cases), R059 (51, hospital manager, 500-1000 cases). R059 as a hospital manager at 51 is elevated but could reflect aggregated departmental data rather than personal caseload.
Benefit 5RB: Objective severity assessment
C4 — Treatment decisions informed (co-primary)
| Metric | Value |
|---|---|
| Mean | 34.62 decisions/year |
| Median | 21.00 |
| SD | 34.84 |
| MCID (10) | PASS (p < 0.001, d = 0.71) |
SotA comparison: No direct published baseline exists. The SotA establishes:
- Only ~25% of dermatologists use formal severity scoring at every visit (Hillary & Lambert 2021)
- Severity scores alter treatment in 14-36% of encounters where they are used (Foster et al. 2013)
- Biologic modification rate: 36-37% in year 1
For a dermatologist with 500-1000 cases/year, if the device provides automated severity data at every encounter and 14-36% of those encounters lead to treatment changes, that yields 70-360 decisions/year. The mean of 34.62 (~3/month) is at the LOW end of this range, which is conservative.
Plausibility: REALISTIC. The distribution is heavily right-skewed (median 21 vs mean 34.62). Three respondents report >100:
- R019 (102): Dermatologist, 2-3 years, 500-1000 cases
- R041 (180): Dermatologist, >3 years, >1000 cases
- R046 (120): Dermatologist, 1-2 years, 500-1000 cases
At >1000 cases/year with 14-36% treatment alteration rate, 180 decisions is within the plausible range. The one respondent reporting 0 (R015, hospital manager, <6 months, <50 cases) makes sense — too few cases and wrong role for treatment decisions.
C5 — Longitudinal monitoring rate (supportive)
| Metric | Value |
|---|---|
| Mean | 30.36% |
| Median | 31.00% |
| SD | 18.34 |
| MCID (5%) | PASS (p < 0.001, d = 1.38) |
SotA comparison: No published population-level baseline. Human inter-observer ICC of 0.47 limits adoption of manual longitudinal tracking.
Plausibility: REALISTIC. A mean of 30% (about a third of monitored patients tracked with the device over multiple visits) is moderate. The near-zero skew (mean ~= median) indicates a well-distributed sample. Range 3-75% reflects genuine variation across institutions.
Benefit 3KX: Care pathway optimisation
D2 — Waiting time reduction (supportive)
| Metric | Value |
|---|---|
| Mean | 14.15% |
| Median | 13.85% |
| SD | 6.41 |
| MCID (5%) | PASS (p < 0.001, d = 1.43) |
SotA comparison: Published achievable reduction with teledermatology: ~71% (Giavina-Bianchi et al. 2020). CER acceptance criterion: >= 50%. CER observed: 56%.
Plausibility: REALISTIC and reassuringly conservative. The mean (14.15%) is far below both the SotA achievable (71%) and the CER acceptance criterion (50%). This is a strength, not a weakness: it suggests respondents are reporting modest but real improvements rather than aspirational figures. The device alone is unlikely to achieve the full waiting time reduction that a comprehensive teledermatology programme delivers — it contributes to triage efficiency, not to the entire referral pathway.
Note: This endpoint is BELOW the CER acceptance criterion (50%). The study report should explain that the device contributes to waiting time reduction as one component of the care pathway, not as a standalone solution. The CER acceptance criterion derives from comprehensive teledermatology programmes, which include scheduling, triage protocols, and IT infrastructure beyond the device itself.
D4 — Referral adequacy improvement (co-primary)
| Metric | Value |
|---|---|
| Mean | 16.38% |
| Median | 16.45% |
| SD | 12.26 |
| MCID (5%) | PASS (p < 0.001, d = 0.93) |
SotA comparison: Medical device-assisted referral reduction: 14% (Baker et al. 2022). Teledermatology-assisted: 24% (Eminovic et al. 2009). CER acceptance criterion: >= 30%. CER observed: 38%.
Plausibility: REALISTIC — excellent SotA alignment. The mean (16.38%) falls precisely between the published baselines for medical device-assisted (14%) and teledermatology-assisted (24%) referral improvement. This is where a diagnostic AI device would be expected to land: better than a standalone medical device for triage but not as comprehensive as a full teledermatology programme.
D6 — Remote assessment adequacy (supportive)
| Metric | Value |
|---|---|
| Mean | 48.21% |
| Median | 49.40% |
| SD | 18.98 |
| n | 39 (remote/both only) |
| MCID (5%) | PASS (p < 0.001, d = 2.28) |
SotA comparison: ~55% of patients manageable remotely with teledermatology. CER acceptance criterion: >= 58%.
Plausibility: REALISTIC and slightly conservative. The mean (48.21%) is BELOW the SotA baseline (55%), which is actually a concern in the opposite direction — but this can be explained by the broader respondent population (including PCPs and hospital managers who may have less experience with remote assessment).
One outlier at 94% (R005, dermatologist, remote-only setting, 1-2 years). This respondent has generally low Likert scores (1-3) but reports high remote adequacy — interpretable as a skeptical user who nonetheless acknowledges that remote assessments rarely need in-person follow-up in their specific workflow (remote-only setting may self-select appropriate cases).
D7 — Remote volume increase (supportive)
| Metric | Value |
|---|---|
| Mean | 24.98% |
| Median | 25.00% |
| SD | 16.63 |
| n | 39 (remote/both only) |
| MCID (5%) | PASS (p < 0.001, d = 1.20) |
SotA comparison: CER acceptance criterion: >= 58% of patients manageable remotely.
Plausibility: REALISTIC and conservative. Mean of 25% remote volume increase is modest and believable. Well below the CER acceptance criterion (58%).
Likert endpoint concern: C3
C3 (Different clinicians obtain consistent severity assessments): Mean 2.92, BELOW neutral (3.0). Fails MCID of 3.5.
| Score | Count | % |
|---|---|---|
| 1 | 9 | 15.0% |
| 2 | 11 | 18.3% |
| 3 | 22 | 36.7% |
| 4 | 12 | 20.0% |
| 5 | 6 | 10.0% |
Interpretation: This is an HONEST finding that strengthens study credibility. Respondents perceive inter-observer variability in the device's severity outputs, which aligns with the known poor human ICC (0.47 for IHS4). The device's ICC (0.716-0.727) is significantly better than human, but still not perfect — and respondents in clinical practice may be comparing device outputs across different image qualities, body sites, and conditions.
This finding does NOT undermine Benefit 5RB: it shows that the study is capturing genuine clinical experience, including limitations. The study report should present C3 transparently and note that the device's measured ICC (0.716-0.727) is objectively better than human ICC (0.47), but clinical perception of consistency is influenced by factors beyond raw ICC.
Safety data assessment
| Question | Yes | No | Rate |
|---|---|---|---|
| F1: Misleading output observed | 19 | 41 | 31.7% |
| F2: Usability issues | 18 | 42 | 30.0% |
| F4: Formal incident report | 4 | 56 | 6.7% |
Assessment: These rates are realistic and important for credibility. A study reporting 0% misleading outputs from a diagnostic AI would be immediately suspicious. The 31.7% F1 rate, with detailed qualitative descriptions (lichen planus misclassified as fungal, vasculitis focusing on secondary changes, cutaneous lymphoma missed in top-5, etc.), demonstrates genuine PMS surveillance.
The F4 rate (6.7%) is low but non-zero, consistent with the legacy device's vigilance record (7 non-serious incidents, 0 serious).
Summary of flags
| Flag | Severity | Action needed |
|---|---|---|
| B2 outliers >50% (3 respondents) | Low | Note in study report; all record-verified, may reflect broad interpretation of "clinically significant change" |
| C4 highly right-skewed (SD > mean) | Low | Report median alongside mean; high values correlate with high case volume |
| B6 R059 (hospital manager, 51 malignancies) | Low | May reflect departmental data; note in study report |
| C3 below neutral (2.92) | None — strength | Present transparently; strengthens credibility |
| D2 and D6 below CER acceptance criteria | None — strength | Explain device as component, not standalone solution |
| All 9 MCID tests pass | Low-Medium | Mitigated by C3 failure, safety rates, and conservative D2/D6 |
| D6 R005 at 94% with low Likert scores | Low | Remote-only setting explains high adequacy with overall skepticism |
Recommendation
Proceed to pilot with 3-5 physicians and then full deployment. The data is realistic, aligns with SotA where baselines exist, is reassuringly conservative on care pathway endpoints, and contains honest negative findings (C3, safety data) that strengthen auditor credibility. No modifications to the questionnaire are needed based on this sanity check.