Response: Item 2B
Introduction
In response to the observations regarding the clarity, traceability, and justification of clinical benefit, performance, and safety outcomes, we have performed a comprehensive update of our technical documentation. This update specifically addresses the derivation of acceptance criteria from the State of the Art (SotA), justifies specific thresholds, clarifies metric definitions, explains data pooling and indication labels, and provides the safety benchmarking against the SotA.
1. Traceability of Acceptance Criteria to State of the Art (SotA)
We acknowledge the observation that the analytical link between the reported SotA baselines and the acceptance criteria was not sufficiently explicit, and that the rationale for selecting specific articles and similar devices was not fully detailed. To address this, we have updated R-TF-015-003 (Clinical Evaluation Report) and significantly expanded R-TF-015-011 (State of the Art).
- Rationale for Selection (Why they were chosen): In the SotA document (
R-TF-015-011), we have added a new subsection, "Rationale for the Selection of Articles and Similar Devices". This explicitly details that articles were not chosen arbitrarily, but prioritized based on their clinical relevance (e.g., evaluating human-in-the-loop performance, which matches our intended use) and methodological quality (prioritizing meta-analyses and MRMC pivotal trials). It also clarifies that similar devices (like DermaSensor, SkinVision, ModelDerm) were included because their FDA/CE-marked status establishes the current technological and competitive benchmark for acceptable benefit-risk profiles. - Systematic Derivation & Deep Analysis: We moved beyond summarizing articles by adding a detailed "Data Pooling and Statistical Analysis" methodology and a "Clinical Domains and Traceability to SotA" section in both the CER and SotA documents. These sections provide a direct chain of evidence, explicitly linking the Clinical Claim to the specific subset of chosen SotA Articles, explaining the statistical synthesis performed (meta-analysis or weighted average), the Derived SotA Baseline, and the final Acceptance Criterion. This approach ensures that the benchmarks remain statistically grounded and directly traceable to the highest-quality clinical evidence, rather than being simple summaries.
- Pooling Methodology: We have documented the pooling methodology in the CER (
R-TF-015-003, section "Data Pooling Methodology"). Aggregate performance metrics (globalValueOfDevice) are calculated using a weighted average formula:Sigma(achievedValue x sampleSize) / Sigma(sampleSize). The pooled studies were explicitly evaluated for clinical comparability and homogeneity. They present populations representative of real-world clinical practice in both primary care and dermatology consultations, ensuring results are applicable to the intended population and supporting the generalizability of the findings.
2. Justification of Acceptance Criteria
The following justifies the acceptance criteria values that were noted as appearing low. These explanations have been incorporated into the CER (R-TF-015-003):
- Alopecia Severity Assessment (Cohen's Kappa >= 0.6): There is a recognized lack of specific literature addressing inter-observer agreement for pathological severity assessment of Female Androgenetic Alopecia. In the absence of disease-specific benchmarks, we adopted the Landis & Koch framework, which is the gold standard for interpreting the Cohen's Kappa metric. In this framework, 0.41-0.60 represents "Moderate" agreement. Given the inherent subjectivity of visual severity scales in alopecia, a threshold of kappa >= 0.6 (Substantial agreement) is a rigorous and clinically acceptable benchmark for a medical device intended to standardize assessments.
- Diagnostic Accuracy in Rare Diseases (Accuracy >= 54%): Skin rare diseases present a significant challenge due to low incidence and high misdiagnosis rates. An acceptance criterion of 54% represents a significant documented clinical benefit over unaided HCPs. On average, for both dermatologists and PCPs, the use of the device resulted in a 26.77% increase in Top-1 diagnostic accuracy, a 25.56% increase in sensitivity, and a 23.50% increase in specificity for rare diseases (based on pivotal studies BI 2024 and PH 2024). This represents a meaningful improvement in diagnostic precision for complex cases where biopsy remains the current alternative.
- Teledermatology Referral Outcomes (Sensitivity Improvement >= 30%): The 30% figure does not represent the total sensitivity, but rather the improvement in the detection of cases requiring referral when using the device, compared to an unaided baseline (which in our studies was 0% for remote detection of specific referral criteria). This represents a clinically meaningful documented enhancement of primary care physician performance during teledermatology consultations.
- Expert Panel Alignment (Majority Vote >= 75%): Methodological literature for expert consensus does not set a single universal threshold; however, an agreement of >= 75% is frequently considered a substantial or optimal majority consensus in clinical validation. This threshold ensures the device aligns with the consolidated judgment of a qualified expert panel, providing a robust reference standard for performance evaluation.
3. Metric Definitions and Terminology
To ensure clarity for clinical reviewers and users, and to explicitly address the rationale behind the chosen metrics, we have expanded the metric definitions and indication labels:
- Metric Rationale and Relevance: We have added explicit definitions for Top-1, Top-3/Top-5 accuracy, AUC, Sensitivity, Specificity, PPV, NPV, ICC, Unweighted Kappa, and Experts' Consensus to the Glossary of the CER (
R-TF-015-003), explicitly detailing why they were chosen and how they are relevant to the intended purpose:- Top-1 Accuracy: Represents the "exact match" performance. Chosen to benchmark the algorithm's absolute precision against the single primary diagnosis made by clinicians.
- Top-3 / Top-5 Accuracy: Reflect the real-world clinical workflow of formulating a differential diagnosis. High Top-3/5 accuracy ensures the correct diagnosis is presented among the suggestions, prompting the HCP to consider it.
- AUC, Sensitivity, and Specificity: AUC demonstrates core discriminative power independent of thresholds; Sensitivity and Specificity demonstrate safety and utility at the clinical operating point.
- PPV and NPV: Chosen to evaluate the reliability of positive and negative findings respectively, quantifying the probability that the device's output correctly reflects the patient's true state.
- Intraclass Correlation Coefficient (ICC) and Unweighted Kappa: Chosen to evaluate the consistency and inter-rater agreement between the device's quantitative/categorical severity assessments and expert clinical judgment.
- Experts' Consensus (Majority Vote): Chosen to establish a robust reference standard for complex cases where individual expert opinions may vary.
- Efficiency Metrics: Explicit definitions for Reduction in Cumulative Waiting Time and Reduction in Unnecessary Referrals were added to quantify the systemic impact of the device on healthcare workflows.
- Multiple Conditions Clarification: We have added a clarification in the CER (
R-TF-015-003, section "Clarification on Multiple conditions"). The indication label "Multiple conditions" does not refer to an unspecified group of diseases. It reflects a broad, representative inclusion aligned with the diverse ICD-11 categories evaluated in the respective clinical studies, mirroring the device's intended diagnostic scope.
4. Clinical Safety Outcomes and Benchmarking
To clarify how safety rates are established based on SotA and similar devices, and to provide the requested traceability, we have expanded both the CER (R-TF-015-003) and the SotA document (R-TF-015-011):
- Quantitative SotA Analysis: We have added a new section to the SotA document, "Hazards and Safety Rates of AI-Guided Medical Devices", which performs a deep, quantitative analysis of safety outcomes (adverse events, false negatives, and technical failures) reported in vigilance databases and literature for similar devices. This provides the direct source and justification for the benchmark safety rates used.
- Safety Benchmarking: The CER now includes a section, "Safety Benchmarking against State of the Art", which presents a direct comparison between our observed safety outcomes (0 incidents in investigations with >800 patients) and the benchmarks derived in the SotA analysis.
- Justification of Relevance: We have added specific "Relevance to Acceptance Criteria" descriptions for each safety hazard in the SotA document. This explains how market error rates (e.g., the 5-10% false negative rates of similar AI tools) were analyzed to justify our device's stringent safety objectives, accounting for the human-in-the-loop clinical workflow. This ensures that our safety acceptance criteria are both clinically relevant and appropriate for a Class IIb device.
5. Use Environment vs. Remote Care
Regarding the observation that clinical benefit 0ZC (remote diagnosis/referral) seems to contradict the use environment stated in §14 of the CEP, we clarify that there is no contradiction. The apparent conflict arises from conflating the IT deployment environment with the clinical workflow modality:
- Use Environment (IT Deployment Context): The text stating that the device is "intended to be used in the setting of healthcare organisations" describes where the software runs—specifically, as an API integrated into a healthcare organisation's IT infrastructure.
- Remote Care (Clinical Workflow Modality): A clinician reviewing images remotely (e.g., via teleconsultation) while accessing the device through their organisation's systems is using the device "in the setting of healthcare organisations." Teledermatology is a standard clinical workflow that operates entirely within the stated IT use environment. No changes to the intended purpose or use environment text are required.
6. Navigability and Evidence Synthesis
To address the concern regarding the large number of individual performance claims (~148) and provide a coherent view of the evidence base, we have added a "Summary of Clinical Benefits Achievement" table in the CER (R-TF-015-003). This aggregate view demonstrates that the device has successfully achieved all defined goals across the seven clinical domains. The observed aggregate magnitudes are summarized as follows:
- Improved Diagnostic Accuracy (7GH): Achieved a +18.5% aggregate increase in Top-1 accuracy across all HCP tiers (Acceptance Criterion: >= 15%). Supported by 70 aggregated claims (e.g., MRT, 9D7, ZKC...) derived from studies: BI_2024, IDEI_2023, MC_EVCDAO_2019, PH_2024, SAN_2024.
- Reduced Waiting Times (3KX): Achieved a 56% reduction in cumulative waiting time (Acceptance Criterion: >= 50%). Supported by 14 aggregated claims (ZGP, RND, 3BD, NVT, VCT, KPQ, 1M1, UGS, IP4, WOI, V2J, WL4, LYP, 8MV) derived from studies: COVIDX_EVCDAO_2022, DAO_Derivación_PH_2022, DAO_Derivación_O_2022, PH_2024, SAN_2024.
- Optimized Referral Prioritization (8PL): Achieved a 38% reduction in unnecessary referrals (Acceptance Criterion: >= 30%). Supported by 8 aggregated claims (DCH, DZC, CST, 6H0, H4U, 04D, D62, 8H5) derived from studies: DAO_Derivación_O_2022.
- Accuracy in Malignancy Detection (1QF): Achieved an aggregate AUC of 0.97 (Acceptance Criterion: >= 90%). Supported by 20 aggregated claims (EAC, DX7, LU4, R9P, 0L2, 7ZI, FIQ, GS5, 6EP, PZD, V2U, 6U1, JFM, ZM8, 4JY, 9OD, BRI, VFY, 9G4, Z96) derived from studies: DAO_Derivación_PH_2022, DAO_Derivación_O_2022, IDEI_2023, MC_EVCDAO_2019.
- Accuracy in Rare Diseases (9VW): Achieved 54.8% aggregate Top-1 accuracy (Acceptance Criterion: >= 54%). Supported by 24 aggregated claims (DII, KOQ, NK7, DR7, 0I1, WAM, JBB, ERK, 8PG, DIK, 99Y, 8QZ, 4KO, S03, TG6, OR5, Q2D, MM8, I7Y, Z90, 6YW, REV, 5W2, CH0) derived from studies: BI_2024, PH_2024.
- Severity Assessment Support (5RB): Achieved an ICC of 0.727 for severity assessment (Acceptance Criterion: >= 0.72). Supported by 9 aggregated claims (LL5, SDP, 3OA, EZ1, JWQ, A1Q, 284, 3OB, 7TS) derived from studies: AIHS4_2025, COVIDX_EVCDAO_2022, IDEI_2023.
- Remote Care Capacity (0ZC): Achieved a +30% improvement in referral sensitivity compared to unaided remote baselines (Acceptance Criterion: >= 30%). Supported by 5 aggregated claims (P30, LHF, 4BO, WOI, WL4) derived from studies: COVIDX_EVCDAO_2022, DAO_Derivación_O_2022, PH_2024, SAN_2024.
By presenting these aggregate outcomes, we frame the ~148 detailed performance claims as robust supporting evidence for a clear and unified clinical benefit case.
Summary of Changes
- R-TF-015-011 (State of the Art): Added the Methodology for Establishing Acceptance Criteria.
- R-TF-015-003 (Clinical Evaluation Report):
- Added the "Acceptance Criteria Derivation from State of the Art" section with detailed literature derivation mappings.
- Added the "Summary of Clinical Benefits Achievement" table to provide a coherent aggregate view of the evidence.
- Added the "Data Pooling Methodology" and "Clarification on Multiple conditions" sections.
- Added the "Safety Benchmarking against State of the Art" section comparing safety outcomes to similar devices from vigilance databases.
- Updated the Glossary with definitions for Top-1, Top-3, and Top-5 accuracy.
- IFU: Updated the Glossary with definitions for Top-1, Top-3, and Top-5 accuracy.