Response
Description of the Non-Conformity
During the audit, it was noted that the test results for T377 in response.json do not structurally align with the expected results in master.csv (e.g., missing icd_distribution and top_5_predictions keys). Furthermore, the expected values for entropy and probability were similar, but not mathematically identical to those in response.json.
Root Cause Analysis
We have reviewed the discrepancy with our Medical Data Science (MDS) and Engineering teams. The observed variances stem from two distinct, expected conditions in our transition from a research environment to a production clinical environment:
- Schema Mismatch (Structural): The baseline expected results provided by the MDS team utilize internal research-oriented data structures, such as partitioned lists for
top_5_predictionsandfull_distribution. Conversely, the medical device exposes a production-ready clinical data model via its API, utilizing flattened arrays mapped to formal clinical ontologies (e.g.,studyAggregate.findings.hypotheses). The discrepancy is a known artifact of structural translation, not a functional failure. - Probability & Entropy Variance (Numerical Non-Determinism): The minor variance in probability and entropy values is caused by numerical non-determinism across different hardware architectures. The MDS development environment utilized NVIDIA RTX 6000 GPUs, whereas the production inference environment utilizes NVIDIA L4 Tensor Core GPUs. Different low-level CUDA optimizations cause minor floating-point rounding differences, which are naturally amplified in non-linear calculations such as softmax and entropy. These represent minor precision shifts rather than a divergence in clinical model behavior.
Corrective Actions and Updated Documentation
To resolve this non-conformity and prevent future occurrences, we have updated our Quality Management System and test execution records to formalize the evaluation of AI/ML outputs:
Action 1: Systemic Update to R-TF-012-033 Software Test Plan
We have introduced a new section, AI/ML Model Output Verification and Schema Mapping, into our approved Software Test Plan. This establishes the following regulatory controls:
- Schema Translation: Direct key-for-key structural comparisons are invalid without an approved mapping defined in the test case's pass/fail criteria. Test cases defined in TestRail must explicitly document the structural mapping between the MDS expected format and the production API format.
- Evaluation Rule for Probabilities: Test automation must evaluate numerical outputs using an MDS-approved absolute tolerance threshold of 0.01. This ensures correct clinical behavior while accounting for hardware-specific floating-point arithmetic.
Action 2: Update to Objective Evidence
The expected_results field for the affected test cases (T377, T378, T379) in TestRail has been formally revised and approved to include:
- An explicit mapping table detailing how keys such as
icd_distributionandnametranslate to the API'sstudyAggregate.findings.hypothesesandconcepts. - A strict acceptance criterion for numerical outputs, stating that direct equality (
==) is not applicable and defining the acceptable variance as an absolute difference of< 0.01.
Conclusion
The structural and numerical differences identified are expected behaviors of the production system. By formalizing the schema translation and establishing a quantitative tolerance threshold in our Software Test Plan and TestRail records, we have established objective, traceable pass/fail criteria. We kindly request that this non-conformity be reviewed and closed based on the provided systemic and case-specific documentation updates.