Response

Clarification: AI Risk Assessment and Risk Management Record

BSI references three specific rows from R-TF-028-011 (AI Risk Assessment): AI-RISK-001, AI-RISK-016, and AI-RISK-021. Before addressing the four concerns, we wish to clarify the relationship between R-TF-028-011 and the patient safety risk management system.

R-TF-028-011 (AI Risk Assessment) is a development-phase document that analyses AI/ML-specific failure modes during the model development lifecycle. It uses an AI-specific severity scale calibrated to development outcomes (e.g., "Critical (4) = Delayed serious entity identification").

R-TF-013-002 (Risk Management Record) is the authoritative patient safety risk register per ISO 14971:2019. It contains the 62 risks that constitute the device's safety risk assessment, using the patient harm severity scale defined in R-TF-013-003 (e.g., "Critical (5) = Death; Major (3) = Injury requiring medical intervention").

The AI risks identified in R-TF-028-011 transfer to corresponding patient safety risks in R-TF-013-002, where they are assessed for actual patient harm. The mapping is as follows:

AI-RISK-001 (Dataset Not Representative of Intended Use Population) maps to patient safety risks R-SKK, R-7US, and R-GY6 in R-TF-013-002, assigned Severity 3 (Major).
AI-RISK-016 (Model Robustness Failures: Sensitivity to Image Acquisition Variability) maps to patient safety risks R-SKK and R-VL1 in R-TF-013-002, assigned Severity 3 (Major).
AI-RISK-021 (Usability Issues: Model Outputs Not Interpretable by Clinical Users) maps to patient safety risk R-SKK in R-TF-013-002, assigned Severity 3 (Major).

Reconciliation of the two severity values. On the AI scale, Severity 4 ("Serious harm including delayed identification of a serious clinical entity") is the correct maximum for AI-RISK-001 and AI-RISK-016: the AI failure mode in isolation is delayed or incorrect identification, not death. Severity 5 on the AI scale ("Death or irreversible serious harm") is not assigned because death from an AI failure mode requires the simultaneous failure of all downstream clinical barriers, which are outside the scope of the AI development risk assessment. On the patient-harm scale in R-TF-013-002, these same risks are assessed at Severity 3 (Major) because the downstream barriers (healthcare professional judgment, binary safety indicators, probability distribution output format, standard of care, follow-up consultation) attenuate the harm pathway. The scale reconciliation is documented in R-TF-028-011 section "Severity alignment between the AI scale and the patient-harm scale" and in R-TF-013-003 section "Severity assignment justification for clinical decision support software".

The following responses address BSI's four concerns using the authoritative patient safety risk documentation in R-TF-013-002 and R-TF-013-003.

1. Severity justification: why severity is not 5

The severity assigned to the patient safety risks corresponding to AI-RISK-001, AI-RISK-016, and AI-RISK-021 is Severity 3 (Major) on the ISO 14971 patient harm scale, not Severity 4 or 5. The explicit justification for this assignment is documented in R-TF-013-003 Risk Management Report, section "Severity assignment justification for clinical decision support software" (located immediately after the Severity scale table).

The justification rests on the following regulatory and clinical grounds:

1. Device role as Clinical Decision Support System (CDSS):

The device is explicitly designed and intended as a clinical decision support tool, not a diagnostic device. The device provides an interpretative distribution of probable ICD-11 categories and quantitative severity data to support (not replace) healthcare professional judgment. The healthcare professional always makes the final diagnostic decision. The device output is one input among many clinical inputs (patient history, physical examination, dermoscopy, laboratory tests, clinical experience).

2. P₂ = 1 architectural constraint:

As documented in R-TF-013-003, the RPN formula for this device is simplified to RPN = P₁ × S because P₂ (probability of hazardous situation leading to harm) equals 1. The device cannot directly cause physical harm to patients. Any harm pathway is indirect through clinical decision-making.

For a device error to result in patient death (Severity 5) or permanent impairment (Severity 4), the following independent safety barriers would all need to fail simultaneously:

The device misclassifies the condition or provides incorrect severity quantification
The six binary safety indicators (malignant, pre-malignant, associated with malignancy, pigmented lesion, urgent referral ≤48h, high-priority referral ≤2 weeks) fail to flag the presentation
The probability distribution output, which presents a ranked list of possibilities rather than a single assertion, fails to include the correct diagnosis in the Top-5
The healthcare professional does not exercise independent clinical judgment
The applicable standard of care (e.g., biopsy for suspected malignancy) is not followed

3. ISO 14971:2019 Annex C guidance:

The standard's guidance on clinical decision support systems recognises that when a healthcare professional retains final authority over clinical decisions, the causal chain from device error to patient harm includes human judgment as a controlling factor. This multi-barrier architecture justifies a severity rating that reflects the attenuated harm pathway.

4. MDCG 2020-1 Valid Clinical Association:

The device's outputs are scientifically associated with actual dermatological conditions, established through systematic literature review (R-TF-015-011 State of the Art). Device errors are therefore errors in probabilistic ranking, not random failures. The clinical barriers to harm are corroborated by the device's own performance data documented in R-TF-015-003 Clinical Evaluation Report.

This severity assignment methodology has been applied consistently across all 62 risks in R-TF-013-002 Risk Management Record.

2. Design mitigations and verification of effectiveness

The patient safety risks R-SKK, R-7US, and R-VL1 in R-TF-013-002 include documented design mitigations with full traceability to software requirements and verification test cases. The response below leads with the DIQA image quality gate identified by BSI as a specific example of a relevant design mitigation.

DIQA image quality gate (addresses BSI's specific example for AI-RISK-016)

The Dermatology Image Quality Assessment (DIQA) model is a runtime design mitigation that evaluates each input image and rejects images outside acceptable acquisition parameter ranges before they are processed by the clinical models. DIQA evaluates four dimensions: focus, lighting, framing, and resolution, and returns a quality score together with dimension sub-scores.

Specification and implementation: DIQA is implemented as SRS requirement SRS-Y5W (derived from PRS-7XK).

Verification of implementation: DIQA is verified by 11 test cases in R-TF-012-034 Software Test Description: C50, C62, C68, C73, C77, C106, C329, C370, C371, C454, and C455. All 11 test cases passed in R-TF-012-033 Software Tests Plan.

Verification of effectiveness: R-TF-025-007 Summative Evaluation Report confirms that when users capture images outside acceptable quality parameters, the device provides meaningful quality feedback and prevents processing. R-TF-015-003 Clinical Evaluation Report documents that clinical validation studies used real-world images with varying acquisition conditions and the device maintained performance within the pre-defined acceptance thresholds.

Performance threshold: DIQA is required to meet Accuracy ≥90% and Pearson r ≥0.80 per R-TF-028-011 performance threshold table.

R-SKK: Incorrect results shown to patient

This risk directly addresses the patient harm scenarios arising from AI-RISK-001 (dataset representativity), AI-RISK-016 (image variability), and AI-RISK-021 (output interpretability).

Design mitigations (from R-TF-013-002, "Implemented Mitigations"):

The device returns an interpretative distribution representation of possible ICD categories, not just one single condition
The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics
Information about device outputs are detailed in the IFU

SRS requirements implementing these mitigations (from R-TF-013-002, "Mitigation Requirements"):

SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS, SRS-Q3Q, SRS-0AB, SRS-K7M

Verification test cases (from R-TF-013-002, "Verification of Implementation"):

Implementation verified through test cases documented in R-TF-012-034 Software Test Description: C106, C454, C455, C50, C62, C68, C73, C77, C255 (T122), C256 (T123), C265 (T132). Test execution results recorded in R-TF-012-033 Software Tests Plan with all tests passed.

AI Models Integration Tests (T307-T379, C466-C539) provide model-level verification of probability distribution outputs and ICD distributions.

Verification of effectiveness (from R-TF-013-002, "Verification of Effectiveness"):

R-TF-025-007 Summative Evaluation Report
R-TF-015-003 Clinical Evaluation Report (sections: Instructions for Use, Associated Design Product Requirement, Valid clinical association of the International Classification of Diseases (ICD) classes, Clinical performance)

R-7US: Biased or incomplete training data

This risk specifically addresses AI-RISK-001 (dataset representativity).

Design mitigations:

Careful image selection in collaboration with Healthcare Organisations
HCP-performed labeling to ensure quality

SRS requirements: SRS-Q9M, SRS-B8N, SRS-E4R, SRS-Y6F, SRS-T3K, SRS-H9X

Verification test cases: C124, C269, C276, C277, C279, C284 (R-TF-012-034)

Verification of effectiveness: R-TF-007-003 Post-Market Surveillance Report (legacy device incident data and trend analysis)

R-VL1: Device failure or performance degradation

This risk addresses AI-RISK-016 (image variability robustness) scenarios where technical issues cause degraded outputs.

Design mitigations:

Scalable (elastic) server infrastructure
Meaningful error messages returned in case of errors

SRS requirements: SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS

Verification test cases: C106, C454, C455, C50, C62, C68, C73, C77 (R-TF-012-034)

Verification of effectiveness: R-TF-025-007 Summative Evaluation Report (for error messages and response time monitoring)

3. Occurrence rate estimation basis

The likelihood values assigned to risks in R-TF-013-002 are based on a documented three-phase methodology described in R-TF-013-003 Risk Management Report, section "Likelihood estimation methodology and data sources" (located immediately after the Probability scale table). The post-market surveillance data used in Phase 2 is appraised using IMDRF MDCE WG/N56:2019 Appendix F quality criteria as endorsed by MDCG 2020-6 Appendix I for post-market clinical evidence under MDR.

Phase 1: Initial likelihood estimation (design phase).

Initial likelihood values are estimated during the design phase based on Design Failure Mode and Effects Analysis (FMEA), expert clinical judgment from the risk management team, published literature on comparable AI/ML dermatology devices and failure modes, and analysis of device architecture and safeguards.

Phase 2: Residual likelihood validation (clinical phase) — post-market surveillance data.

Post-market surveillance data from the equivalent legacy device is documented in R-TF-007-003 Post-Market Surveillance Report (legacy device), covering the full surveillance period from market introduction in 2020 through the close of the current PMS cycle in Q1 2026, under MDR Article 85 as applicable via MDR Article 120(3). The source data for likelihood validation is:

Market exposure denominator: 21 customer contracts, approximately 250,000 diagnostic reports, more than 500 practitioners, more than 100,000 patients
Serious incidents (MDR Article 2(65)): zero, confirmed for the full surveillance period (no FSCA triggered, no Article 88 trend report triggered)
Category 3a customer-reported events in commercial use: three events over approximately four years — one clinical-output accuracy feedback and two API availability events. All closed, no patient harm reported in any case
Category 3b internal validation/testing findings: six findings catalogued in the manufacturer's internal non-conformity register. These are not PMS signals in the vigilance sense; they are internal pre-deployment validation observations that fed algorithm and infrastructure improvements. They are used here for completeness in the likelihood appraisal, not as post-market incident counts
Category 4 non-safety complaints: two events (integration-format query and summative-usability-study authentication friction). Neither related to clinical output
Corrective and preventive actions: all customer-reported events closed with root cause analysis and paired CAPA where applicable

The events relevant to the three BSI-flagged risks were appraised individually against IMDRF MDCE WG/N56:2019 Appendix F quality criteria:

Clinical-output accuracy events reaching patients (Category 3a): one customer-reported clinical-output accuracy event (May 2023). Investigation did not establish a systematic malfunction against specified performance; the customer's image-acquisition conditions and reference-standard methodology were the dominant factors.
API availability events (Category 3a): two customer-reported API availability events (September 2023 and September 2024). Infrastructure-side root causes, remediated through paired CAPAs; no clinical misclassification propagated to patients.
Internal clinical-model observations (Category 3b): three findings exercising the hazardous situations modelled by the risks referenced in the observation — dataset bias affecting pigmented-lesion classification, zoom-dependent classification, and camera-device-dependent malignancy scoring. All detected through internal pre-deployment validation activities rather than clinical use, and all fed algorithm improvements through paired CAPAs.

None of the events in any category resulted in patient harm, none were classified as serious incidents under MDR 2017/745 Article 2(65), and none required a Field Safety Corrective Action. In every case the hazardous situation (an AI output error) was intercepted by the downstream clinical barriers before reaching the patient, which is consistent with the multi-barrier severity justification in Section 1 and with the P₂=1 conservative convention documented in R-TF-013-003.

Phase 2: Residual likelihood validation — clinical and usability evidence.

R-TF-015-003 Clinical Evaluation Report: documents clinical performance across validation studies (BI_2024, PH_2024, SAN_2024, IDEI_2023) spanning Fitzpatrick phototypes I-IV, with no systematic performance degradation observed across demographic subgroups. Real-world prospective malignancy detection AUC 0.9430 (95% CI 0.8132-1.0000), sensitivity 97.06%, specificity 97.06%.
R-TF-025-007 Summative Evaluation Report (October 2025, n=36): HCP Scenario 3 Q4 achieved 72.2% success on understanding that the device output is not a diagnosis, with one use error and three close calls observed. Simulated use scenarios achieved 100% success. Residual risk assessed per IEC 62366-1:2015 §5.9.

Phase 3: Continuous validation (post-market phase).

Likelihood values are continuously validated through PMS and PMCF activities defined in R-TF-007-001 PMS Plan and R-TF-007-002 PMCF Plan. If real-world occurrence rates deviate from estimates, the risk management file is updated under change control.

Derived likelihood values from the appraised data:

The likelihood assessment measures the probability that the hazardous situation reaches the patient as harm, not the probability that the device produces an incorrect output in isolation. Even where the hazardous situation occurred (AI output error), the downstream clinical barriers prevented patient harm in all cases; this is the relevant denominator for residual likelihood on the patient-harm scale.

R-7US (Biased or incomplete training data, AI-RISK-001): controlled likelihood Very low (1), i.e. less than 1% per R-TF-028-011 likelihood scale. Basis: zero Category 3a customer-reported events attributable to dataset bias; the underlying dataset-bias pathway was detected through an internal Category 3b observation during pre-deployment validation, not through clinical use, and was remediated through paired CAPA; clinical validation across Fitzpatrick I-IV with no subgroup degradation observed.
R-VL1 (Device failure or performance degradation, AI-RISK-016): controlled likelihood Very low (1). Basis: two Category 3a customer-reported API availability events (2/250,000 ≈ 0.0008%), zero serious incidents; the camera-device-variability pathway was detected through an internal Category 3b observation during pre-deployment validation, not through clinical use; DIQA quality gate verified by eleven test cases, all passed; real-world validation images used across acquisition conditions.
R-SKK (Incorrect results shown, AI-RISK-001/016/021): controlled likelihood Very low (1). Basis: one Category 3a customer-reported clinical-output accuracy event (May 2023; 1/250,000 ≈ 0.0004%), investigated and not established as systematic malfunction; zero serious incidents, zero patient harm events; usability validation 72.2% Q4 success with one use error and three close calls observed in n=36 per IEC 62366-1:2015 §5.9.

In each case the empirical post-market occurrence rate attributable to Category 3a customer-reported events is well within the Very low (1) band of the R-TF-028-011 likelihood scale (less than 1% occurrence rate) and is corroborated by the patient-harm denominator of zero serious incidents across the surveillance period.

This three-phase approach, grounded in IMDRF MDCE WG/N56 Appendix F appraisal of PMS data, fulfils the requirements of ISO 14971:2019 Section 7 and MEDDEV 2.7.1 Rev 4 Annex A7.2 for occurrence rate estimation based on available clinical data, and meets the MDCG 2020-6 §6.3 requirement that complaint/incident counts are appraised through a validated methodology rather than cited as bare ratios.

4. Residual risk communication in the IFU

The residual risks for R-SKK, R-7US, and R-VL1 are communicated to users through the Instructions for Use. The "Mitigation Requirements" section of each risk in R-TF-013-002 includes Labeling Requirements (LR codes) that specify the IFU content.

Residual risk from dataset representativity (R-7US → AI-RISK-001)

IFU location: Important Safety Information, § Population and performance variability

Content: States that performance may vary across Fitzpatrick skin phototypes, age groups, and geographic populations. Instructs clinicians to exercise particular judgment for underrepresented populations (phototypes V-VI, paediatric, geriatric).

Labeling requirements: LR-4XK, LR-9WR

Residual risk from image acquisition variability (R-VL1 → AI-RISK-016)

IFU location: How to take pictures; Precautions (risks #9 and #30)

Content:

Detailed guidance on lighting (even illumination, avoid harsh shadows), distance (10-30 cm), angle (perpendicular to skin surface), and focus
Common issues table (motion blur, glare, shadows, poor background) with solutions
Statement that image artefacts and resolution affect device performance
Instruction to review image quality information returned by the device

Labeling requirements: LR-3KN, LR-4RZ

Residual risk from output interpretability (R-SKK → AI-RISK-021)

IFU location: Important Safety Information, § The device does not provide a clinical diagnosis; § Understanding the device output; Endpoint specification

Content:

Prominent warning that the device output is clinical decision support information, not a diagnosis
Explanation that the device produces a probability distribution representing a range of possibilities
Explanation of how the six binary safety indicators are derived
Explanation of entropy as a confidence measure with visual examples
Interpretation guidance for severity measurements and clinical sign scores

Labeling requirements: LR-8YN, LR-4RZ

Quantification of residual risk information

Per MDR Annex I GSPR 23.4 (contents of the instructions for use) and MEDDEV 2.7.1 Rev 4 Annex A7.2 (clinical risks evaluated as rates), the qualitative limitations and user instructions in the IFU listed above are underpinned by quantitative performance data in R-TF-015-003 Clinical Evaluation Report, which the IFU cross-references through the "Clinical performance" section. The quantitative benchmarks that address the three BSI-flagged residual risks are:

False-negative rate for malignancy detection (addresses R-7US / AI-RISK-001 and R-SKK / AI-RISK-021): Top-3 sensitivity 90.32% for melanoma (residual false-negative rate ≤9.68% at Top-3 ranking); real-world prospective malignancy detection sensitivity 97.06% (residual false-negative rate ≤2.94%); binary indicator AUC thresholds ≥0.80 per R-TF-028-011.
False-positive rate (addresses unnecessary referral or procedure pathway): Top-1 specificity 80.54% for melanoma; real-world prospective malignancy specificity 97.06%; real-world referral optimisation melanoma-specific specificity 91%.
Image acquisition variability (addresses R-VL1 / AI-RISK-016): DIQA performance thresholds Accuracy ≥90% and Pearson r ≥0.80 against human quality ratings per R-TF-028-011.
Subgroup performance (addresses R-7US / AI-RISK-001): clinical validation across Fitzpatrick phototypes I-IV with no systematic performance degradation observed; Fitzpatrick V-VI have limited representation, which is disclosed in the IFU "Population and performance variability" section with instruction to exercise particular judgment for those phototypes.

These quantitative values ensure that each qualitative IFU warning is supported by measurable performance data traceable to the Clinical Evaluation Report, as required by GSPR 23.4 and A7.2.

Summary

0. Scale reconciliation: Explicit reconciliation of AI-scale Severity 4 with patient-harm-scale Severity 3, with cross-reference added directly to R-TF-028-011. Reference: R-TF-028-011 / R-TF-013-003, section "Severity alignment between the AI scale and the patient-harm scale."
1. Severity justification: Explicit rationale for Severity 3 (not 5) based on CDSS architecture, P₂=1 constraint, multi-barrier safety design, ISO 14971 Annex C, and MDCG 2020-1 VCA. Reference: R-TF-013-003, section "Severity assignment justification for clinical decision support software."
2. Design mitigations verified: DIQA image quality gate (SRS-Y5W, 11 test cases all passed) plus design mitigations for R-SKK, R-7US, R-VL1 with full SRS and verification traceability. Reference: R-TF-013-002, sections "Implemented Mitigations", "Mitigation Requirements", "Verification of Implementation."
3. Occurrence rate basis: Three-phase methodology with IMDRF MDCE WG/N56 Appendix F appraisal of PMS data (21 contracts, ≥250,000 reports, 0 serious incidents, 3 Category 3a customer-reported events, 0 FSCAs across the full surveillance period 2020–Q1 2026). Reference: R-TF-013-003 / R-TF-007-003, section "Likelihood estimation methodology and data sources."
4. Residual risk in IFU: Mapped to specific IFU sections via Labeling Requirements (LR codes); GSPR 23.4 and MEDDEV A7.2 quantification satisfied through cross-reference from IFU to R-TF-015-003 CER data. Reference: R-TF-013-002 / R-TF-015-003, IFU "Clinical performance" plus sections cited above.

Red-lined versions of R-TF-013-003 Risk Management Report and R-TF-028-011 AI Risk Assessment are provided as supplementary evidence.

Relevant commits

bb767b4a0 (2026-04-18, Taig Mac Carthy): Updated R-TF-028-011 AI Risk Assessment with a "Severity scale calibration" note beneath the Severity Scale table and a new "Severity alignment between the AI scale and the patient-harm scale" subsection under Integration with Device Risk Management. These additions reconcile the AI-specific severity scale (calibrated to AI/ML failure modes) with the patient-harm severity scale in R-TF-013-003 (calibrated to ultimate clinical outcome per ISO 14971:2019), map the three BSI-flagged AI risks to their corresponding patient-safety risks in R-TF-013-002 (R-SKK, R-7US, R-GY6, R-VL1), and record that all three map to Severity 3 (Major) on the patient-harm scale.
948c83524 (2026-04-07, Saray Ugidos Semán): Updated R-TF-013-003 Risk Management Report with two new sections immediately after the Severity scale table and Probability scale table respectively: "Severity assignment justification for clinical decision support software" (documents the four regulatory and clinical grounds for Severity 3 as the maximum patient-harm severity for CDSS output risks) and "Likelihood estimation methodology and data sources" (documents the three-phase likelihood estimation methodology with references to the legacy device PMS Report, R-TF-015-003 CER, and R-TF-025-007 Summative Evaluation Report). Pivoted the Item 7 response strategy from updating R-TF-028-011 JSON fields per risk to redirecting BSI to the authoritative patient-safety risk register R-TF-013-002 and R-TF-013-003.

Clarification: AI Risk Assessment and Risk Management Record​

1. Severity justification: why severity is not 5​

2. Design mitigations and verification of effectiveness​

DIQA image quality gate (addresses BSI's specific example for AI-RISK-016)​

R-SKK: Incorrect results shown to patient​

R-7US: Biased or incomplete training data​

R-VL1: Device failure or performance degradation​

3. Occurrence rate estimation basis​

4. Residual risk communication in the IFU​

Residual risk from dataset representativity (R-7US → AI-RISK-001)​

Residual risk from image acquisition variability (R-VL1 → AI-RISK-016)​

Residual risk from output interpretability (R-SKK → AI-RISK-021)​

Quantification of residual risk information​

Summary​

Relevant commits​