Research and planning

Internal working document

This page is for internal planning only. It will not be included in the final response to BSI.

What BSI is asking

BSI reviewed risk R-DAG ("The medical device outputs a wrong result") in the Risk Management Record (R-TF-013-002) and found four implemented mitigations listed:

Information about device outputs are detailed in the IFU.
The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics.
The device returns an interpretative distribution representation of possible ICD categories, not just one single condition.
AI models undergo retraining using expanded dataset of images.

BSI then cross-referenced these mitigations with the "Mitigation or Control Requirement(s)" and "Verification of implementation of risk control measures" columns, checking against the software requirements (R-TF-012-034) and test descriptions. They could not find corresponding requirements or test evidence that clearly address explainability, interpretive distributions, retraining, or IFU information about device outputs.

BSI also flags: "It is unclear if other risks are similarly impacted" — implying they suspect a systemic traceability gap.

Underlying regulatory concern: EN ISO 14971:2019 requires a complete, verifiable traceability chain for risk controls. The specific sub-clauses BSI is testing:

ISO 14971 sub-clause	Requirement	How it applies here
7.2	Risk control measures shall be implemented and their implementation verified	The core issue — traceability from mitigation → requirement → test must be demonstrable
7.4	Benefit-risk analysis for residual risks	Corrected traceability must not change the benefit-risk conclusion
7.6	Completeness of risk control	The "other risks" audit addresses whether risk control is complete across the register

BSI's cited GSPRs map as follows:

GSPR	Requirement	Relevance to N3
GSPR 1	Devices shall achieve intended performance and be suitable for their intended purpose	The mitigations (explainability, distributions, IFU) ensure the device output supports HCP decision-making as intended
GSPR 4	Manufacturers shall establish and maintain a risk management system per Annex I §3	The traceability chain (risk → control → requirement → verification) is a core element of this system
GSPR 17.2	Diagnostic devices shall provide sufficient accuracy, precision, and stability	The ICD probability distribution and explainability media are the mechanisms by which accuracy/precision are communicated to the HCP

BSI also cites Annex II documentation requirements:

Annex II section	What it requires	How it applies
5(b)	Description and justification of residual risks	R-TF-013-002 must demonstrate that residual risks are acceptable after controls are verified
6.1(a)/(b)	Evidence of GSPR compliance (tests, clinical data, etc.)	The verification test cases are the evidence — they must clearly map to the mitigations
6.2(f)	Risk analysis including risk control measures	The complete traceability chain in R-TF-013-002 fulfils this requirement

What BSI is NOT saying: They are not saying the mitigations are unimplemented. They are saying they could not find the traceability evidence linking mitigations → requirements → tests. This is a documentation/traceability gap, not necessarily a missing implementation gap.

Root cause diagnosis

The central issue is that R-DAG's mitigationRequirements field contains the same SRS codes as its causeRequirements field — these are infrastructure/API codes, not the codes that implement the actual mitigations:

Field	SRS codes	What they cover
`causeRequirements`	SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS	API port listening, HTTP status codes, JSON format, authentication, clinical params endpoint, URL versioning
`mitigationRequirements` (SRS part)	SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS	Identical to cause codes
`mitigationRequirements` (LR part)	LR-4XK, LR-9WR, LR-4RZ, LR-8YN	IFU read instruction, output interpretation guidance, warnings/precautions, HCP supervision

The test cases in verificationOfImplementation (C106, C454, C455, C50, C62, C68, C73, C77) all map to those infrastructure SRS codes — they verify HTTP status codes, JSON format, authentication, and API versioning. None of them verify explainability, probability distributions, or AI outputs. This is why BSI found them irrelevant.

The actual SRS codes and test cases that implement and verify the mitigations do exist but were never linked to R-DAG. The analysis below is mitigation by mitigation.

Mitigation-by-mitigation analysis

Mitigation 1: "Information about device outputs are detailed in the IFU"

Status: Implemented. Traceability incomplete.

What exists in the IFU:

The IFU contains comprehensive documentation of all device output fields:

IFU section	Path	What it covers
User Interface (device outputs)	`apps/eu-ifu-mdr/versioned_docs/version-1.1.0.0/installation-manual/user-interface.mdx`	Full JSON output structure: probability distributions (`conclusions` array), entropy scores (0-100 with thresholds), explainability media (`explainabilityMedia` field), clinical indicators, severity scores, image quality
Clinical troubleshooting	`apps/eu-ifu-mdr/versioned_docs/version-1.1.0.0/troubleshooting/clinical.mdx`	How to interpret interpretive distributions, entropy as uncertainty measure, top-5 accuracy approach, explainability media for understanding AI reasoning
JSON output example	`apps/eu-ifu-mdr/src/components/AnonymousDiagnosticReport/_anonymous_diagnostic_report_json.mdx`	Complete JSON output specimen with all explainability fields populated

LR requirements correctly listed in R-DAG:

LR-9WR (Device outputs interpretation guidance): Explains probability distribution format, entropy scores, heat maps, clinical indicator meanings
LR-4RZ (Warnings and precautions): Warns that outputs support (not replace) clinical judgment; requires review of explainability media
LR-8YN (Device supervision requirement): Mandates HCP supervision; final diagnostic decisions remain with HCP
LR-4XK (Read the IFU before use): Directs users to the complete IFU

Gap: BSI notes that "none of the tests appear to verify information about device outputs in the IFU." This overlaps with M2 Q2, which also flags that labeling requirements verification evidence could not be found. The LR codes in R-DAG are the correct mitigation references, but the verification chain for labeling requirements is incomplete. Our M2 response will establish the LR verification chain; N3 can cross-reference it.

Corrective action: No change needed to R-DAG's mitigation requirements for this item (LR codes are correct). The labeling verification gap is addressed systemically in M2 Q2.

Mitigation 2: "The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics"

Status: Implemented and verified. Traceability broken — wrong SRS codes and test cases referenced in R-DAG.

SRS requirements that implement this mitigation (exist but NOT listed in R-DAG):

SRS code	Title	What it requires
SRS-0AB	Generate per-image ICD analysis with explainability heat map	For each image, generate: ICD category probabilities + explainability object with Base64-encoded heat map (`heatMap`), its `contentType`, and `title`
SRS-K7M	Orchestrate diagnosis support workflow	Generate pixel-level attention indicators (heat maps or saliency masks) that highlight image regions most influential to each predicted category

Note: SRS-Q9M (Clinical Signs Analysis Endpoint) was considered but excluded. SRS-Q9M covers the POST /clinical-signs-analysis severity assessment endpoint, which is a different analysis pathway from the ICD diagnosis workflow. R-DAG's risk is specifically about the ICD interpretive distribution, so only SRS codes directly implementing the ICD pathway should be referenced to keep traceability tight and defensible.

Test cases that verify this mitigation (exist but NOT listed in R-DAG):

Test ID	Case ID	Title	What it verifies	SRS
T123	C256	Verify response includes per-image ICD probabilities and heat maps for top five categories	`explanation.attentionMap` objects, colour model data, Base64-encoded image data	SRS-0AB
T132	C265	Verify diagnosis workflow returns ranked ICD-11 codes, binary indicators, and explainability maps	Entropy of result, pixel-level attention indicators (heat maps/saliency masks) for top-5 conclusions	SRS-K7M

What is currently in R-DAG instead: SRS-7PJ (API port listening), SRS-AQM (HTTP status codes), etc., verified by C50 (accepts HTTP requests), C62 (returns 200), etc. — entirely unrelated to explainability.

Corrective action: Add SRS-0AB, SRS-K7M to mitigationRequirements. Add C256 (T123), C265 (T132) to verificationOfImplementation.

Mitigation 3: "The device returns an interpretative distribution representation of possible ICD categories, not just one single condition"

Status: Implemented and verified. Traceability broken — wrong SRS codes and test cases referenced in R-DAG.

SRS requirements that implement this mitigation (exist but NOT listed in R-DAG):

SRS code	Title	What it requires
SRS-Q3Q	Generate an aggregated ICD probability distribution from a set of images	Return a normalized probability distribution across all ICD categories (not a single diagnosis). Each element contains: calculated probability, official ICD code, display name, system identifier, and version
SRS-K7M	Orchestrate diagnosis support workflow	Compute normalized probability vector across all supported ICD-11 categories (sum = 100%). Generate top-5 ranked output with ICD-11 codes and confidence scores

Test cases that verify this mitigation (exist but NOT listed in R-DAG):

Test ID	Case ID	Title	What it verifies	SRS
T122	C255	Verify API returns aggregated ICD probability distribution with structured code details	`hypotheses` array with numeric probability fields, valid ICD-11 code structures, distribution across all categories	SRS-Q3Q
T132	C265	Verify diagnosis workflow returns ranked ICD-11 codes, binary indicators, and explainability maps	Top-5 ranked ICD-11 categories, probability sum = 100% across full distribution, entropy, five binary indicators	SRS-K7M

Additionally, the AI Models Integration Tests (T307-T379, C466-C539) verify that each individual AI model produces correct probability_distribution outputs and icd_distribution data with entropy scores and top-5 predictions — providing model-level evidence that the interpretive distribution is generated correctly at every layer of the system.

Corrective action: Add SRS-Q3Q, SRS-K7M to mitigationRequirements. Add C255 (T122), C265 (T132) to verificationOfImplementation. Consider referencing the AI Models Integration Tests (T307-T379) as additional model-level verification evidence.

Mitigation 4: "AI models undergo retraining using expanded dataset of images"

Status: This is a prospective lifecycle/process control, not a software feature. It has no software-level traceability because it should not have any.

This mitigation is fundamentally different from mitigations 1-3. It is not something the device software does at runtime — it is something the organisation does as part of its AI lifecycle management. It is:

Defined in GP-028 AI Development, § AI Updates → Retraining: "Retraining is performed when an algorithm's core logic or data foundation is modified. This includes training on new or updated data, implementing a new model architecture, or changing key parameters/hyperparameters."
Documented via R-TF-028-007 AI Retraining Report (mandatory output of any retraining)
Governed by GP-024 PCCP (Predetermined Change Control Plan), which classifies retraining as a minor or major AI model version change
Verified through R-TF-028-010 AI V&V Checks (mandatory verification before any retrained model is released)
Monitored via GP-028 post-market surveillance provisions, which feed back into retraining decisions

Relevant documents:

Document	Path
GP-028 AI Development	`apps/qms/docs/procedures/GP-028/index.mdx`
GP-024 PCCP	`apps/qms/docs/procedures/GP-024/index.mdx`
T-028-007 AI Retraining Report template	`apps/qms/docs/procedures/GP-028/Templates/T-028-007.mdx`
R-TF-028-010 AI V&V Checks (v1.1.0.0)	`apps/qms/docs/legit-health-plus-version-1-1-0-0/product-verification-and-validation/artificial-intelligence/r-tf-028-010-aiml-vv-checks.mdx`

Important distinction — prospective vs. completed: No retraining has been performed for v1.1.0.0 (no completed R-TF-028-007 record exists). Retraining is a prospective control: it will be triggered when PCCP criteria are met (e.g., post-market data indicating performance drift, new training data available). The mitigation statement in R-DAG should therefore be reworded to reflect this accurately:

Current wording (misleading): "AI models undergo retraining using expanded dataset of images."
Proposed wording: "AI models are subject to retraining under expanded datasets as governed by GP-028 (§ AI Updates → Retraining) and GP-024 (PCCP), with verification through R-TF-028-010 (AI V&V Checks) before any retrained model is released."

This wording honestly describes the control without implying retraining has already occurred for this version.

Gap: The risk management record currently references only software test cases in verificationOfImplementation. There is no mechanism to reference process-level controls. The retraining mitigation has no explicit traceability at all in R-TF-013-002.

Corrective action:

Reword the mitigation statement in implementedMitigations to use the proposed wording above.
Add a reference to GP-028 (§ AI Updates → Retraining), GP-024 (PCCP), and R-TF-028-010 (AI V&V Checks) in verificationOfImplementation. This requires extending the verification text to include process-level references alongside test case references.
In the response to BSI, explicitly explain that retraining is a lifecycle control verified through QMS process adherence, not through runtime software tests, and that it is a prospective control governed by PCCP.

"It is unclear if other risks are similarly impacted": Systematic audit results

BSI explicitly asks whether other risks have the same traceability gap. A systematic audit of all 62 risks in R-TF-013-002 was performed, checking three criteria:

Whether mitigationRequirements SRS codes are just copies of causeRequirements (rather than codes implementing the actual mitigations)
Whether verificationOfImplementation test cases verify the mitigation requirements (not just the cause requirements)
Whether process-level controls (e.g. retraining) have any traceability at all

Audit findings summary

29 out of 62 risks have some form of the traceability gap BSI identified in R-DAG. They fall into three categories:

Category A: Identical cause/mitigation codes with infrastructure-only verification (21 risks) — CRITICAL

These risks have mitigationRequirements SRS codes identical to causeRequirements — no additional mitigation codes were added. Their verification test cases only cover infrastructure (API port, HTTP status codes, JSON format, authentication, versioning). This is the exact pattern BSI flagged in R-DAG.

Infrastructure/API group (cause = SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS):

Risk ID	Risk name	Mitigation type	Gap
R-T8Q	Data transmission failure from HCP system	Error handling + availability	No SRS codes for error handling or availability mitigations
R-3N5	Data input failure	Error handling + availability	Same as R-T8Q
R-YF4	Data accessibility failure	Error handling + availability	Same as R-T8Q
R-LRP	Data transmission failure	Error messages + FHIR	No LR codes for FHIR IFU documentation
R-MWD	Interruption of service	Elastic scaling, backups, REST	No SRS/LR codes for scaling or backup mitigations
R-OM1	Data overwrite	REST protocol immutability	Architectural argument, no distinct mitigation code
R-B63	Inconsistent or unreliable output	Algorithm V&V with representative datasets	Process-level (GP-012), no requirement code
R-VL1	Device failure or performance degradation	Elastic scaling + error messages	No SRS for auto-scaling; no LR for error messaging
R-72D	SOUP anomaly/incompatibility	Careful SOUP analysis	Process-level mitigation, no requirement trace
R-MQ1	SOUP not maintained nor patched	SOUP monitoring and patching	Process-level mitigation, no requirement trace

Regulatory/GSPR group:

Risk ID	Risk name	Mitigation type	Gap
R-QLF	Non-compliance with GSPR	Develop per harmonised standards	Process-level, no SRS/LR trace
R-ES8	Absence of risk management process	ISO 14971 implementation	Process-level, no SRS/LR trace
R-C6Q	Absence of PMS & PMCF process	PMS/PMCF plans	Process-level, no SRS/LR trace
R-27M	Inadequate maintenance	Maintenance plan	Process-level, no SRS/LR trace
R-HH0	Electronic data tampered	OAuth/JWT, encryption, SSL/TLS	Security SRS codes exist (SRS-1KW, SRS-WER, SRS-SDZ, SRS-WGF) but are NOT referenced
R-9SS	SOUP cybersecurity vulnerabilities	SOUP analysis + design review	Process-level, no requirement code
R-33B	Electronic IFU tampered	GPG signed commits, RBAC, branch approvals	Toolchain controls, no product-level SRS/LR codes

AI/ML group:

Risk ID	Risk name	Mitigation type	Gap
R-GY6	Inaccurate training data	Careful image selection, hired HCPs	Process-level, no requirement trace
R-7US	Biased or incomplete training data	Same as R-GY6	Same gap
R-75L	Stagnation of model performance	Plan for retraining, data augmentation	Process-level, no requirement trace
R-PWK	Degradation of model performance	Manual retraining, data augmentation	Process-level, no requirement trace

Category B: Retraining mitigation with no traceability (5 risks) — HIGH

These risks include "AI models undergo retraining" as an implemented mitigation but have no corresponding requirement code or process-level verification reference:

Risk ID	Risk name	Mitigation wording	Additional issue
R-DAG	Wrong result (ICD distribution)	"AI models undergo retraining using expanded dataset of images"	The original BSI finding
R-75H	Incorrect clinical information	"AI models undergo retraining using expanded dataset of images"	Same infrastructure-only verification as R-DAG
R-SKK	Incorrect results shown to patient	"AI models undergo retarining [sic] using expanded dataset of images"	Typo: "retarining" → "retraining"
R-75L	Stagnation of model performance	"We plan for re-training during the design and development process"	Also in Category A
R-PWK	Degradation of model performance	"we plan for exclusively manual retraining"	Also in Category A

Category C: Risks with better traceability (not impacted)

R-BDR (Misinterpretation of data returned by the device) was initially suspected but appears better traced than R-DAG. It adds LR codes (LR-4XK, LR-9WR, LR-8HV, LR-5TG) beyond the cause codes, and its verification test set (C368, C369, C373, C374, etc.) includes FHIR-specific tests, not just the generic infrastructure set. However, R-BDR should still be reviewed to confirm its LR verification chain is complete.

The remaining 33 risks either have no mitigations (risks accepted without control), have correctly differentiated mitigation codes, or have mitigations whose traceability is appropriate.

How to report this to BSI

The response should:

Acknowledge that the audit found additional risks with the same traceability pattern
Categorise the findings: (a) risks where mitigation codes need correction, (b) risks where process-level controls need traceability references
State that all affected risks have been corrected in the updated R-TF-013-002 (red-lined version provided)
Note the R-SKK typo correction as part of the update
Confirm that risks not in these categories were verified as correctly traced

Relationship with other NCs

NC	Overlap with N3	How to handle in N3 response
M2 Q2	Labeling requirements (LR-XXX) verification gap. The LR codes in R-DAG are correct, but the verification evidence for labeling requirements is also questioned in M2. Our M2 response establishes the LR verification chain.	N3 should state: "The LR codes (LR-4XK, LR-9WR, LR-4RZ, LR-8YN) are the correct mitigation references for this item. These labeling requirements are verified against the IFU content as documented in R-TF-012-037; the complete verification evidence for labeling requirements is provided in our response to M2 Q2." This makes N3 self-contained while avoiding duplication.
M1 Q4	BSI found that `response.json` for test T377 was missing `icd_distribution` and `top_5_predictions` keys. This relates directly to mitigations 2 and 3 of R-DAG (probability distribution, ICD categories).	N3 should note that the AI Models Integration Tests (T307-T379) provide model-level verification evidence for ICD distributions, and reference M1 Q4 for the detailed explanation of the test evidence format.

Response strategy

Approach: Acknowledge the traceability gap, demonstrate the implementations exist, provide corrected documentation, and report the results of a systematic audit of all risks.

The response to BSI should:

Acknowledge that BSI correctly identified a traceability gap in R-TF-013-002 for R-DAG, per ISO 14971:2019 clause 7.2 (verification of implementation of risk control measures)
Provide a mitigation-by-mitigation mapping for R-DAG showing: mitigation statement → SRS/LR requirement(s) → test case(s) → result, demonstrating compliance with ISO 14971:2019 clause 7.2 and Annex II 6.1(b)
Explain that "retraining" is a prospective lifecycle control governed by GP-028 and GP-024 (PCCP), which will be verified through R-TF-028-010 (AI V&V Checks) before any retrained model is released — not through runtime software tests. Cite ISO 14971:2019 clause 7.2 note on risk control measures that may include "inherent safety by design, protective measures, or information for safety"
Confirm that the retraining mitigation statement has been reworded to accurately reflect its prospective nature
State that R-TF-013-002 has been updated with correct traceability for R-DAG (red-lined version provided), satisfying Annex II 6.2(f)
Report audit results: A systematic audit of all 62 risks identified 29 risks with analogous traceability gaps (21 with identical cause/mitigation codes, 5 with untraced retraining mitigations, plus overlap). All have been corrected in the updated R-TF-013-002. This addresses ISO 14971:2019 clause 7.6 (completeness of risk control)
Confirm that the benefit-risk analysis conclusions in R-TF-013-002 are unchanged by the traceability corrections, per ISO 14971:2019 clause 7.4
Cross-reference M2 Q2 for the labeling requirements verification chain, while keeping N3 self-contained
Reference GSPR 1 (intended performance), GSPR 4 (risk management system), and GSPR 17.2 (diagnostic accuracy) to tie corrective actions back to the cited requirements

Decision: infrastructure SRS codes in R-DAG

Decision: Keep existing infrastructure codes AND add the correct mitigation codes (Option C).

Rationale: The infrastructure codes (SRS-7PJ, SRS-AQM, SRS-BYJ, SRS-DW0, SRS-D3N, SRS-LBS) provide the foundational transport layer through which the clinical outputs are delivered. While they do not directly implement the mitigations BSI flagged, removing them could be seen as overcorrection and BSI has not asked for their removal. The correct approach is to add the missing mitigation-specific codes (SRS-Q3Q, SRS-0AB, SRS-K7M) alongside the existing infrastructure codes, making the traceability chain complete.

Corrective actions summary

#	Action	What to change	File	Status
1	Add correct SRS codes to R-DAG mitigationRequirements	Add SRS-Q3Q, SRS-0AB, SRS-K7M	`R-TF-013-002.json`	To do
2	Add correct test cases to R-DAG verificationOfImplementation	Add C255 (T122), C256 (T123), C265 (T132); reference AI Models Integration Tests (T307-T379) as additional model-level evidence	`R-TF-013-002.json`	To do
3	Add process-level traceability for retraining	Reference GP-028 (§ AI Updates → Retraining), GP-024 (PCCP), R-TF-028-010 in verificationOfImplementation	`R-TF-013-002.json`	To do
4	Reword retraining mitigation statement	Change from present-tense "undergo" to prospective "are subject to" wording	`R-TF-013-002.json`	To do
5	Fix R-SKK typo	"retarining" → "retraining"	`R-TF-013-002.json`	To do
6	Correct all 29 audited risks	For each: add correct mitigation codes, add process-level references where applicable, verify test case mapping	`R-TF-013-002.json`	To do
7	Add security SRS codes to R-HH0	Add SRS-1KW, SRS-WER, SRS-SDZ, SRS-WGF (exist but not referenced)	`R-TF-013-002.json`	To do
8	Keep infrastructure SRS codes in R-DAG	Do NOT remove SRS-7PJ, SRS-AQM, etc. — add alongside, not replace	`R-TF-013-002.json`	Decision made
9	Generate red-lined R-TF-013-002 PDF	For BSI submission	Export from QMS	To do

What BSI is asking​

Root cause diagnosis​

Mitigation-by-mitigation analysis​

Mitigation 1: "Information about device outputs are detailed in the IFU"​

Mitigation 2: "The medical device returns metadata about the output that helps supervising it, such as explainability media and other metrics"​

Mitigation 3: "The device returns an interpretative distribution representation of possible ICD categories, not just one single condition"​

Mitigation 4: "AI models undergo retraining using expanded dataset of images"​

"It is unclear if other risks are similarly impacted": Systematic audit results​

Audit findings summary​

Category A: Identical cause/mitigation codes with infrastructure-only verification (21 risks) — CRITICAL​

Category B: Retraining mitigation with no traceability (5 risks) — HIGH​

Category C: Risks with better traceability (not impacted)​

How to report this to BSI​

Relationship with other NCs​

Response strategy​

Decision: infrastructure SRS codes in R-DAG​

Corrective actions summary​