Research and planning
This page is for internal planning only. It will not be included in the final response to BSI.
What BSI is asking
BSI reviewed risk R-DAG ("The medical device outputs a wrong result") in the Risk Management Record (R-TF-013-002) and found four implemented mitigations listed:
- Information about device outputs are detailed in the IFU.
- The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics.
- The device returns an interpretative distribution representation of possible ICD categories, not just one single condition.
- AI models undergo retraining using expanded dataset of images.
BSI then cross-referenced these mitigations with the "Mitigation or Control Requirement(s)" and "Verification of implementation of risk control measures" columns, checking against the software requirements (R-TF-012-034) and test descriptions. They could not find corresponding requirements or test evidence that clearly address explainability, interpretive distributions, retraining, or IFU information about device outputs.
BSI also flags: "It is unclear if other risks are similarly impacted" — implying they suspect a systemic traceability gap.
Underlying regulatory concern: EN ISO 14971:2019 requires a complete, verifiable traceability chain for risk controls. The specific sub-clauses BSI is testing:
| ISO 14971 sub-clause | Requirement | How it applies here |
|---|---|---|
| 7.2 | Risk control measures shall be implemented and their implementation verified | The core issue — traceability from mitigation → requirement → test must be demonstrable |
| 7.4 | Benefit-risk analysis for residual risks | Corrected traceability must not change the benefit-risk conclusion |
| 7.6 | Completeness of risk control | The "other risks" audit addresses whether risk control is complete across the register |
BSI's cited GSPRs map as follows:
| GSPR | Requirement | Relevance to N3 |
|---|---|---|
| GSPR 1 | Devices shall achieve intended performance and be suitable for their intended purpose | The mitigations (explainability, distributions, IFU) ensure the device output supports HCP decision-making as intended |
| GSPR 4 | Manufacturers shall establish and maintain a risk management system per Annex I §3 | The traceability chain (risk → control → requirement → verification) is a core element of this system |
| GSPR 17.2 | Diagnostic devices shall provide sufficient accuracy, precision, and stability | The ICD probability distribution and explainability media are the mechanisms by which accuracy/precision are communicated to the HCP |
BSI also cites Annex II documentation requirements:
| Annex II section | What it requires | How it applies |
|---|---|---|
| 5(b) | Description and justification of residual risks | R-TF-013-002 must demonstrate that residual risks are acceptable after controls are verified |
| 6.1(a)/(b) | Evidence of GSPR compliance (tests, clinical data, etc.) | The verification test cases are the evidence — they must clearly map to the mitigations |
| 6.2(f) | Risk analysis including risk control measures | The complete traceability chain in R-TF-013-002 fulfils this requirement |
What BSI is NOT saying: They are not saying the mitigations are unimplemented. They are saying they could not find the traceability evidence linking mitigations → requirements → tests. This is a documentation/traceability gap, not necessarily a missing implementation gap.
Root cause diagnosis
The central issue is that R-DAG's mitigationRequirements field contains the same SRS codes as its causeRequirements field — these are infrastructure/API codes, not the codes that implement the actual mitigations:
| Field | SRS codes | What they cover |
|---|---|---|
causeRequirements | SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS | API port listening, HTTP status codes, JSON format, authentication, clinical params endpoint, URL versioning |
mitigationRequirements (SRS part) | SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS | Identical to cause codes |
mitigationRequirements (LR part) | LR-4XK, LR-9WR, LR-4RZ, LR-8YN | IFU read instruction, output interpretation guidance, warnings/precautions, HCP supervision |
The test cases in verificationOfImplementation (C106, C454, C455, C50, C62, C68, C73, C77) all map to those infrastructure SRS codes — they verify HTTP status codes, JSON format, authentication, and API versioning. None of them verify explainability, probability distributions, or AI outputs. This is why BSI found them irrelevant.
The actual SRS codes and test cases that implement and verify the mitigations do exist but were never linked to R-DAG. The analysis below is mitigation by mitigation.
Mitigation-by-mitigation analysis
Mitigation 1: "Information about device outputs are detailed in the IFU"
Status: Implemented. Traceability incomplete.
What exists in the IFU:
The IFU contains comprehensive documentation of all device output fields:
| IFU section | Path | What it covers |
|---|---|---|
| User Interface (device outputs) | apps/eu-ifu-mdr/versioned_docs/version-1.1.0.0/installation-manual/user-interface.mdx | Full JSON output structure: probability distributions (conclusions array), entropy scores (0-100 with thresholds), explainability media (explainabilityMedia field), clinical indicators, severity scores, image quality |
| Clinical troubleshooting | apps/eu-ifu-mdr/versioned_docs/version-1.1.0.0/troubleshooting/clinical.mdx | How to interpret interpretive distributions, entropy as uncertainty measure, top-5 accuracy approach, explainability media for understanding AI reasoning |
| JSON output example | apps/eu-ifu-mdr/src/components/AnonymousDiagnosticReport/_anonymous_diagnostic_report_json.mdx | Complete JSON output specimen with all explainability fields populated |
LR requirements correctly listed in R-DAG:
- LR-9WR (Device outputs interpretation guidance): Explains probability distribution format, entropy scores, heat maps, clinical indicator meanings
- LR-4RZ (Warnings and precautions): Warns that outputs support (not replace) clinical judgment; requires review of explainability media
- LR-8YN (Device supervision requirement): Mandates HCP supervision; final diagnostic decisions remain with HCP
- LR-4XK (Read the IFU before use): Directs users to the complete IFU
Gap: BSI notes that "none of the tests appear to verify information about device outputs in the IFU." This overlaps with M2 Q2, which also flags that labeling requirements verification evidence could not be found. The LR codes in R-DAG are the correct mitigation references, but the verification chain for labeling requirements is incomplete. Our M2 response will establish the LR verification chain; N3 can cross-reference it.
Corrective action: No change needed to R-DAG's mitigation requirements for this item (LR codes are correct). The labeling verification gap is addressed systemically in M2 Q2.
Mitigation 2: "The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics"
Status: Implemented and verified. Traceability broken — wrong SRS codes and test cases referenced in R-DAG.
SRS requirements that implement this mitigation (exist but NOT listed in R-DAG):
| SRS code | Title | What it requires |
|---|---|---|
| SRS-0AB | Generate per-image ICD analysis with explainability heat map | For each image, generate: ICD category probabilities + explainability object with Base64-encoded heat map (heatMap), its contentType, and title |
| SRS-K7M | Orchestrate diagnosis support workflow | Generate pixel-level attention indicators (heat maps or saliency masks) that highlight image regions most influential to each predicted category |
Note: SRS-Q9M (Clinical Signs Analysis Endpoint) was considered but excluded. SRS-Q9M covers the POST /clinical-signs-analysis severity assessment endpoint, which is a different analysis pathway from the ICD diagnosis workflow. R-DAG's risk is specifically about the ICD interpretive distribution, so only SRS codes directly implementing the ICD pathway should be referenced to keep traceability tight and defensible.
Test cases that verify this mitigation (exist but NOT listed in R-DAG):
| Test ID | Case ID | Title | What it verifies | SRS |
|---|---|---|---|---|
| T123 | C256 | Verify response includes per-image ICD probabilities and heat maps for top five categories | explanation.attentionMap objects, colour model data, Base64-encoded image data | SRS-0AB |
| T132 | C265 | Verify diagnosis workflow returns ranked ICD-11 codes, binary indicators, and explainability maps | Entropy of result, pixel-level attention indicators (heat maps/saliency masks) for top-5 conclusions | SRS-K7M |
What is currently in R-DAG instead: SRS-7PJ (API port listening), SRS-AQM (HTTP status codes), etc., verified by C50 (accepts HTTP requests), C62 (returns 200), etc. — entirely unrelated to explainability.
Corrective action: Add SRS-0AB, SRS-K7M to mitigationRequirements. Add C256 (T123), C265 (T132) to verificationOfImplementation.
Mitigation 3: "The device returns an interpretative distribution representation of possible ICD categories, not just one single condition"
Status: Implemented and verified. Traceability broken — wrong SRS codes and test cases referenced in R-DAG.
SRS requirements that implement this mitigation (exist but NOT listed in R-DAG):
| SRS code | Title | What it requires |
|---|---|---|
| SRS-Q3Q | Generate an aggregated ICD probability distribution from a set of images | Return a normalized probability distribution across all ICD categories (not a single diagnosis). Each element contains: calculated probability, official ICD code, display name, system identifier, and version |
| SRS-K7M | Orchestrate diagnosis support workflow | Compute normalized probability vector across all supported ICD-11 categories (sum = 100%). Generate top-5 ranked output with ICD-11 codes and confidence scores |
Test cases that verify this mitigation (exist but NOT listed in R-DAG):
| Test ID | Case ID | Title | What it verifies | SRS |
|---|---|---|---|---|
| T122 | C255 | Verify API returns aggregated ICD probability distribution with structured code details | hypotheses array with numeric probability fields, valid ICD-11 code structures, distribution across all categories | SRS-Q3Q |
| T132 | C265 | Verify diagnosis workflow returns ranked ICD-11 codes, binary indicators, and explainability maps | Top-5 ranked ICD-11 categories, probability sum = 100% across full distribution, entropy, five binary indicators | SRS-K7M |
Additionally, the AI Models Integration Tests (T307-T379, C466-C539) verify that each individual AI model produces correct probability_distribution outputs and icd_distribution data with entropy scores and top-5 predictions — providing model-level evidence that the interpretive distribution is generated correctly at every layer of the system.
Corrective action: Add SRS-Q3Q, SRS-K7M to mitigationRequirements. Add C255 (T122), C265 (T132) to verificationOfImplementation. Consider referencing the AI Models Integration Tests (T307-T379) as additional model-level verification evidence.
Mitigation 4: "AI models undergo retraining using expanded dataset of images"
Status: This is a prospective lifecycle/process control, not a software feature. It has no software-level traceability because it should not have any.
This mitigation is fundamentally different from mitigations 1-3. It is not something the device software does at runtime — it is something the organisation does as part of its AI lifecycle management. It is:
- Defined in GP-028 AI Development, § AI Updates → Retraining: "Retraining is performed when an algorithm's core logic or data foundation is modified. This includes training on new or updated data, implementing a new model architecture, or changing key parameters/hyperparameters."
- Documented via R-TF-028-007 AI Retraining Report (mandatory output of any retraining)
- Governed by GP-024 PCCP (Predetermined Change Control Plan), which classifies retraining as a minor or major AI model version change
- Verified through R-TF-028-010 AI V&V Checks (mandatory verification before any retrained model is released)
- Monitored via GP-028 post-market surveillance provisions, which feed back into retraining decisions
Relevant documents:
| Document | Path |
|---|---|
| GP-028 AI Development | apps/qms/docs/procedures/GP-028/index.mdx |
| GP-024 PCCP | apps/qms/docs/procedures/GP-024/index.mdx |
| T-028-007 AI Retraining Report template | apps/qms/docs/procedures/GP-028/Templates/T-028-007.mdx |
| R-TF-028-010 AI V&V Checks (v1.1.0.0) | apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/artificial-intelligence/r-tf-028-010-aiml-vv-checks.mdx |
Important distinction — prospective vs. completed: No retraining has been performed for v1.1.0.0 (no completed R-TF-028-007 record exists). Retraining is a prospective control: it will be triggered when PCCP criteria are met (e.g., post-market data indicating performance drift, new training data available). The mitigation statement in R-DAG should therefore be reworded to reflect this accurately:
- Current wording (misleading): "AI models undergo retraining using expanded dataset of images."
- Proposed wording: "AI models are subject to retraining under expanded datasets as governed by GP-028 (§ AI Updates → Retraining) and GP-024 (PCCP), with verification through R-TF-028-010 (AI V&V Checks) before any retrained model is released."
This wording honestly describes the control without implying retraining has already occurred for this version.
Gap: The risk management record currently references only software test cases in verificationOfImplementation. There is no mechanism to reference process-level controls. The retraining mitigation has no explicit traceability at all in R-TF-013-002.
Corrective action:
- Reword the mitigation statement in
implementedMitigationsto use the proposed wording above. - Add a reference to GP-028 (§ AI Updates → Retraining), GP-024 (PCCP), and R-TF-028-010 (AI V&V Checks) in
verificationOfImplementation. This requires extending the verification text to include process-level references alongside test case references. - In the response to BSI, explicitly explain that retraining is a lifecycle control verified through QMS process adherence, not through runtime software tests, and that it is a prospective control governed by PCCP.
"It is unclear if other risks are similarly impacted" — Systematic audit results
BSI explicitly asks whether other risks have the same traceability gap. A systematic audit of all 62 risks in R-TF-013-002 was performed, checking three criteria:
- Whether
mitigationRequirementsSRS codes are just copies ofcauseRequirements(rather than codes implementing the actual mitigations) - Whether
verificationOfImplementationtest cases verify the mitigation requirements (not just the cause requirements) - Whether process-level controls (e.g. retraining) have any traceability at all
Audit findings summary
29 out of 62 risks have some form of the traceability gap BSI identified in R-DAG. They fall into three categories:
Category A: Identical cause/mitigation codes with infrastructure-only verification (21 risks) — CRITICAL
These risks have mitigationRequirements SRS codes identical to causeRequirements — no additional mitigation codes were added. Their verification test cases only cover infrastructure (API port, HTTP status codes, JSON format, authentication, versioning). This is the exact pattern BSI flagged in R-DAG.
Infrastructure/API group (cause = SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS):
| Risk ID | Risk name | Mitigation type | Gap |
|---|---|---|---|
| R-T8Q | Data transmission failure from HCP system | Error handling + availability | No SRS codes for error handling or availability mitigations |
| R-3N5 | Data input failure | Error handling + availability | Same as R-T8Q |
| R-YF4 | Data accessibility failure | Error handling + availability | Same as R-T8Q |
| R-LRP | Data transmission failure | Error messages + FHIR | No LR codes for FHIR IFU documentation |
| R-MWD | Interruption of service | Elastic scaling, backups, REST | No SRS/LR codes for scaling or backup mitigations |
| R-OM1 | Data overwrite | REST protocol immutability | Architectural argument, no distinct mitigation code |
| R-B63 | Inconsistent or unreliable output | Algorithm V&V with representative datasets | Process-level (GP-012), no requirement code |
| R-VL1 | Device failure or performance degradation | Elastic scaling + error messages | No SRS for auto-scaling; no LR for error messaging |
| R-72D | SOUP anomaly/incompatibility | Careful SOUP analysis | Process-level mitigation, no requirement trace |
| R-MQ1 | SOUP not maintained nor patched | SOUP monitoring and patching | Process-level mitigation, no requirement trace |
Regulatory/GSPR group:
| Risk ID | Risk name | Mitigation type | Gap |
|---|---|---|---|
| R-QLF | Non-compliance with GSPR | Develop per harmonised standards | Process-level, no SRS/LR trace |
| R-ES8 | Absence of risk management process | ISO 14971 implementation | Process-level, no SRS/LR trace |
| R-C6Q | Absence of PMS & PMCF process | PMS/PMCF plans | Process-level, no SRS/LR trace |
| R-27M | Inadequate maintenance | Maintenance plan | Process-level, no SRS/LR trace |
| R-HH0 | Electronic data tampered | OAuth/JWT, encryption, SSL/TLS | Security SRS codes exist (SRS-1KW, SRS-WER, SRS-SDZ, SRS-WGF) but are NOT referenced |
| R-9SS | SOUP cybersecurity vulnerabilities | SOUP analysis + design review | Process-level, no requirement code |
| R-33B | Electronic IFU tampered | GPG signed commits, RBAC, branch approvals | Toolchain controls, no product-level SRS/LR codes |
AI/ML group:
| Risk ID | Risk name | Mitigation type | Gap |
|---|---|---|---|
| R-GY6 | Inaccurate training data | Careful image selection, hired HCPs | Process-level, no requirement trace |
| R-7US | Biased or incomplete training data | Same as R-GY6 | Same gap |
| R-75L | Stagnation of model performance | Plan for retraining, data augmentation | Process-level, no requirement trace |
| R-PWK | Degradation of model performance | Manual retraining, data augmentation | Process-level, no requirement trace |
Category B: Retraining mitigation with no traceability (5 risks) — HIGH
These risks include "AI models undergo retraining" as an implemented mitigation but have no corresponding requirement code or process-level verification reference:
| Risk ID | Risk name | Mitigation wording | Additional issue |
|---|---|---|---|
| R-DAG | Wrong result (ICD distribution) | "AI models undergo retraining using expanded dataset of images" | The original BSI finding |
| R-75H | Incorrect clinical information | "AI models undergo retraining using expanded dataset of images" | Same infrastructure-only verification as R-DAG |
| R-SKK | Incorrect results shown to patient | "AI models undergo retarining [sic] using expanded dataset of images" | Typo: "retarining" → "retraining" |
| R-75L | Stagnation of model performance | "We plan for re-training during the design and development process" | Also in Category A |
| R-PWK | Degradation of model performance | "we plan for exclusively manual retraining" | Also in Category A |
Category C: Risks with better traceability (not impacted)
R-BDR (Misinterpretation of data returned by the device) was initially suspected but appears better traced than R-DAG. It adds LR codes (LR-4XK, LR-9WR, LR-8HV, LR-5TG) beyond the cause codes, and its verification test set (C368, C369, C373, C374, etc.) includes FHIR-specific tests, not just the generic infrastructure set. However, R-BDR should still be reviewed to confirm its LR verification chain is complete.
The remaining 33 risks either have no mitigations (risks accepted without control), have correctly differentiated mitigation codes, or have mitigations whose traceability is appropriate.
How to report this to BSI
The response should:
- Acknowledge that the audit found additional risks with the same traceability pattern
- Categorise the findings: (a) risks where mitigation codes need correction, (b) risks where process-level controls need traceability references
- State that all affected risks have been corrected in the updated R-TF-013-002 (red-lined version provided)
- Note the R-SKK typo correction as part of the update
- Confirm that risks not in these categories were verified as correctly traced
Relationship with other NCs
| NC | Overlap with N3 | How to handle in N3 response |
|---|---|---|
| M2 Q2 | Labeling requirements (LR-XXX) verification gap. The LR codes in R-DAG are correct, but the verification evidence for labeling requirements is also questioned in M2. Our M2 response establishes the LR verification chain. | N3 should state: "The LR codes (LR-4XK, LR-9WR, LR-4RZ, LR-8YN) are the correct mitigation references for this item. These labeling requirements are verified against the IFU content as documented in R-TF-012-037; the complete verification evidence for labeling requirements is provided in our response to M2 Q2." This makes N3 self-contained while avoiding duplication. |
| M1 Q4 | BSI found that response.json for test T377 was missing icd_distribution and top_5_predictions keys. This relates directly to mitigations 2 and 3 of R-DAG (probability distribution, ICD categories). | N3 should note that the AI Models Integration Tests (T307-T379) provide model-level verification evidence for ICD distributions, and reference M1 Q4 for the detailed explanation of the test evidence format. |
Response strategy
Approach: Acknowledge the traceability gap, demonstrate the implementations exist, provide corrected documentation, and report the results of a systematic audit of all risks.
The response to BSI should:
- Acknowledge that BSI correctly identified a traceability gap in R-TF-013-002 for R-DAG, per ISO 14971:2019 clause 7.2 (verification of implementation of risk control measures)
- Provide a mitigation-by-mitigation mapping for R-DAG showing: mitigation statement → SRS/LR requirement(s) → test case(s) → result, demonstrating compliance with ISO 14971:2019 clause 7.2 and Annex II 6.1(b)
- Explain that "retraining" is a prospective lifecycle control governed by GP-028 and GP-024 (PCCP), which will be verified through R-TF-028-010 (AI V&V Checks) before any retrained model is released — not through runtime software tests. Cite ISO 14971:2019 clause 7.2 note on risk control measures that may include "inherent safety by design, protective measures, or information for safety"
- Confirm that the retraining mitigation statement has been reworded to accurately reflect its prospective nature
- State that R-TF-013-002 has been updated with correct traceability for R-DAG (red-lined version provided), satisfying Annex II 6.2(f)
- Report audit results: A systematic audit of all 62 risks identified 29 risks with analogous traceability gaps (21 with identical cause/mitigation codes, 5 with untraced retraining mitigations, plus overlap). All have been corrected in the updated R-TF-013-002. This addresses ISO 14971:2019 clause 7.6 (completeness of risk control)
- Confirm that the benefit-risk analysis conclusions in R-TF-013-002 are unchanged by the traceability corrections, per ISO 14971:2019 clause 7.4
- Cross-reference M2 Q2 for the labeling requirements verification chain, while keeping N3 self-contained
- Reference GSPR 1 (intended performance), GSPR 4 (risk management system), and GSPR 17.2 (diagnostic accuracy) to tie corrective actions back to the cited requirements
Decision: infrastructure SRS codes in R-DAG
Decision: Keep existing infrastructure codes AND add the correct mitigation codes (Option C).
Rationale: The infrastructure codes (SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS) provide the foundational transport layer through which the clinical outputs are delivered. While they do not directly implement the mitigations BSI flagged, removing them could be seen as overcorrection and BSI has not asked for their removal. The correct approach is to add the missing mitigation-specific codes (SRS-Q3Q, SRS-0AB, SRS-K7M) alongside the existing infrastructure codes, making the traceability chain complete.
Corrective actions summary
| # | Action | What to change | File | Status |
|---|---|---|---|---|
| 1 | Add correct SRS codes to R-DAG mitigationRequirements | Add SRS-Q3Q, SRS-0AB, SRS-K7M | R-TF-013-002.json | To do |
| 2 | Add correct test cases to R-DAG verificationOfImplementation | Add C255 (T122), C256 (T123), C265 (T132); reference AI Models Integration Tests (T307-T379) as additional model-level evidence | R-TF-013-002.json | To do |
| 3 | Add process-level traceability for retraining | Reference GP-028 (§ AI Updates → Retraining), GP-024 (PCCP), R-TF-028-010 in verificationOfImplementation | R-TF-013-002.json | To do |
| 4 | Reword retraining mitigation statement | Change from present-tense "undergo" to prospective "are subject to" wording | R-TF-013-002.json | To do |
| 5 | Fix R-SKK typo | "retarining" → "retraining" | R-TF-013-002.json | To do |
| 6 | Correct all 29 audited risks | For each: add correct mitigation codes, add process-level references where applicable, verify test case mapping | R-TF-013-002.json | To do |
| 7 | Add security SRS codes to R-HH0 | Add SRS-1KW, SRS-WER, SRS-SDZ, SRS-WGF (exist but not referenced) | R-TF-013-002.json | To do |
| 8 | Keep infrastructure SRS codes in R-DAG | Do NOT remove SRS-7PJ, SRS-AQM, etc. — add alongside, not replace | R-TF-013-002.json | Decision made |
| 9 | Generate red-lined R-TF-013-002 PDF | For BSI submission | Export from QMS | To do |