R-TF-028-001 AI/ML Development Plan
Table of contents
Abbreviations
Term | Definition |
---|---|
AI/ML | Artificial Intelligence / Machine Learning |
AUC | Area Under the Receiver Operating Characteristic Curve |
GDPR | General Data Protection Regulation |
GMLP | Good Machine Learning Practice |
ICD | International Classification of Diseases |
ONNX | Open Neural Network Exchange |
QMS | Quality Management System |
RPN | Risk Priority Number |
ViT | Vision Transformer |
XAI | Explainable Artificial Intelligence |
Introduction
Context
Legit.Health Plus provides advanced Clinical Decision Support (CDS) through AI/ML algorithms designed to assist qualified healthcare professionals in the assessment of dermatological conditions. The algorithms analyze clinical and dermoscopic images of skin lesions to generate objective, data-driven insights. It is critical to note that the device is intended to augment, not replace, the clinical judgment of a healthcare professional.
The core AI/ML functionality is delivered through two algorithm types:
- An ICD Category Distribution Algorithm: A multiclass classification model that processes a lesion image and outputs a ranked probability distribution across relevant ICD-11 categories, presenting the top five differential diagnoses.
- Binary Indicator Algorithms: Derived from the primary model's output, these algorithms provide three discrete indicators for case prioritization: Malignancy, Dermatological Condition, and Critical Complexity.
Objectives
The primary objectives of this development plan are to:
- Develop a robust ICD Category Distribution algorithm to assist clinicians in formulating a differential diagnosis, thereby enhancing diagnostic accuracy and efficiency, while meeting the performance endpoints specified in
R-TF-028-001
. - Develop three highly performant Binary Indicator algorithms to provide clear, actionable signals for clinical workflow prioritization, meeting the AUC thresholds defined in
R-TF-028-001
. - Ensure the entire development lifecycle adheres to the company's QMS, GMLP principles, and applicable regulations (MDR 2017/745, ISO 13485) to deliver safe and effective algorithms.
Team
Role | Description And Responsibilities | Person(s) |
---|---|---|
Technical Manager | Overall management of team planning and resources. Ensuring alignment with QMS procedures. Application of this procedure. | Alfonso Medela |
Design & Development Manager | Manages the design and development lifecycle, including verification and validation activities in accordance with GP-012. | Taig Mac Carthy |
AI Team | Develops, validates, and maintains the AI/ML algorithms. Responsible for data management, training, evaluation, and release processes. |
Project Management
Meetings
- Sprint Meetings: The project follows an Agile framework with 2-week sprints. Bi-weekly meetings are held for sprint review, retrospective analysis, and planning.
- Daily Stand-ups: The AI team conducts daily stand-up meetings to synchronize progress, address impediments, and align on daily priorities.
- Technical Reviews: Bi-weekly or monthly meetings are held to present key R&D findings, review model architectures, and discuss experimental results with cross-functional stakeholders.
Management Tools
Tool | Description |
---|---|
Jira | To manage the product backlog, plan sprints, and track all tasks, bugs, and user stories with full traceability. |
GitHub | Central repository for technical documentation, design specifications, meeting minutes, and sprint reports. |
Project Planning
The Technical Manager is responsible for the overall project planning and monitoring, ensuring that development milestones align with the product roadmap and regulatory timelines.
Environment
Development Tools
Tool | Description |
---|---|
Bitbucket / Git | For rigorous version control of all source code, models, and critical configuration files. Enforces peer review via pull requests. |
Docker | To create containerized, reproducible environments, ensuring consistency between development, testing, and deployment. |
MLflow / Weights & Biases | For systematic tracking of experiments, including parameters, metrics, code versions, and model artifacts, ensuring full reproducibility. |
Development Software
Software | Description |
---|---|
Python >=3.9 | Primary programming language. |
TensorFlow >=2.10 / PyTorch >=1.12 | State-of-the-art deep learning frameworks. |
CUDA / cuDNN | NVIDIA libraries for GPU acceleration. |
NumPy, Pandas, Scikit-learn, OpenCV | Core libraries for data manipulation, image processing, and performance evaluation. |
Flake8 / Black / MyPy / Pytest | A suite of tools to enforce code quality, style, type safety, and correctness through automated testing. |
Development Environment
AI/ML development is conducted on a secure, high-performance computing infrastructure.
Environment | Description |
---|---|
Research Server (Ubuntu 22.04 LTS) | Primary environment for model training, evaluation, and experiment management. |
Database | PostgreSQL instance for structured storage of annotations and metadata. |
Data Storage | Secure, access-controlled cloud storage (e.g., AWS S3, Google Cloud Storage) for medical images. |
Research Server Minimum Requirements:
- OS: Ubuntu 22.04 LTS or higher
- GPU: NVIDIA A100 or H100 (or equivalent) with >= 40 GB VRAM
- CPU: `>= 32
cores @ \
>= 2.5` GHz - RAM: `>= 128` GB
- Storage: `>= 5` TB of high-speed NVMe SSD storage
AI/ML Development Plan
Development Cycle
The AI/ML development adheres to the three-phase cycle mandated by procedure GP-028 AI Development
, ensuring a structured progression from design to release.
Development Specifications
All development is strictly governed by the specifications in R-TF-028-001 AI/ML Description
. This document serves as the primary input for design and defines the acceptance criteria for V&V.
Development Steps
- Data Management: Sourcing, curating, annotating, and partitioning data according to GMLP.
- Training & Evaluation: Building, training, tuning, and rigorously evaluating models.
- Release (V&V): Finalizing, documenting, and packaging the model for software integration.
Data Management Plan
Good Practices
Data Collection & Curation
- Representativeness: In line with GMLP principles, data is collected to be highly representative of the intended patient population. Active measures are taken to ensure diversity across age, sex, and all six Fitzpatrick skin phototypes to promote equitable performance.
- Protocols: Data acquisition follows the detailed clinical and technical requirements in
R-TF-028-003
, ensuring consistency in image quality. - Compliance: All data processing is fully compliant with GDPR. Data is de-identified at the source, and robust data protection impact assessments are conducted.
Data Quality & Integrity
- Annotation: Data is labeled by qualified dermatologists following
R-TF-028-004
. Critical labels are subject to a multi-annotator review process to ensure high quality and consistency. - Traceability: Data is managed using version-controlled snapshots. Each snapshot is an immutable, timestamped collection of data and labels, ensuring a complete audit trail from data to the final model.
Ground Truth Determination
- Methodology: The ground truth for diagnoses is established by a panel of at least three board-certified dermatologists. Discrepancies are resolved by a senior reviewer or through histopathological correlation where available and clinically appropriate. This robust process minimizes label noise and ensures a high-fidelity reference standard.
Sequestration of Test Data
- Partitioning: The dataset is partitioned at the patient level into training, validation, and test sets. This strict separation is critical to prevent data leakage and ensure that the final performance evaluation is unbiased.
- Shielding: The test set is a sequestered, held-out dataset used only once for the final, unbiased evaluation of the selected model. It is never used for training, tuning, or model selection.
Working Plan
- Data is collected, de-identified, and securely stored.
- Data is annotated according to the defined multi-stage review process.
- A versioned data snapshot is created and frozen.
- The snapshot is split by patient ID into training, validation, and test sets. The test set is immediately sequestered.
- The snapshot version and split definitions are logged for full reproducibility.
Training & Evaluation Plan
Good Practices
Reproducibility and Traceability
- Versioning: Every component is versioned: Git for code, DVC for data, and MLflow for experiments. Each trained model is linked to the exact code, data, and hyperparameters used to create it.
Model Design & Selection
- Architecture: Model selection is informed by a systematic review of state-of-the-art architectures (e.g., ViT, ConvNeXt, EfficientNetV2).
- Hyperparameter Optimization: A structured approach (e.g., Bayesian optimization or grid search) is used to find the optimal set of hyperparameters.
Model Training & Tuning
- Augmentation: A rich set of data augmentation techniques is used to improve generalization, including geometric transformations (rotation, scaling, flipping) and photometric distortions (brightness, contrast, color jitter) that reflect real-world variability.
- Overfitting Mitigation: In addition to augmentation, techniques like dropout, weight decay, and early stopping are employed to ensure models generalize well to unseen data.
- Model Calibration: Post-training calibration techniques (e.g., temperature scaling) are applied to ensure that the model's output probabilities are reliable and well-calibrated, meaning a predicted 80% confidence accurately reflects an 80% likelihood of correctness.
Model Evaluation & Validation
- Robustness Analysis: Performance is evaluated not just on aggregate metrics but also across key patient subgroups (e.g., by skin phototype, age, sex) to proactively identify and mitigate potential biases.
- Explainability (XAI): During development, XAI techniques (e.g., Grad-CAM, SHAP) are used to visualize and understand the model's decision-making process. This helps verify that the model is learning clinically relevant features and not relying on confounding artifacts.
- Statistical Rigor: All key performance metrics are reported with 95% confidence intervals to accurately represent statistical uncertainty.
Working Plan
- A model configuration file specifies all parameters for a training run.
- The model is trained, with all metrics and artifacts logged in real-time to MLflow.
- A uniquely identified model package is generated, containing the model, its configuration, and training history.
- A final, comprehensive evaluation is performed on the held-out test set, with results and explainability analyses compiled into the final performance report.
Release Plan
Good Practices
- Equivalence Testing: Models are converted to a high-performance format (e.g., ONNX). Rigorous tests are run to verify near-identical numerical output between the original and converted models.
- Comprehensive Reporting: The
AI/ML Development Report (R-TF-028-005)
provides a complete account of the development and V&V process, serving as objective evidence that the model is safe and effective. - Clear Instructions: The
AI/ML Release (R-TF-028-006)
document provides the software team with precise integration specifications. - Semantic Versioning: The algorithm release package is assigned a unique semantic version (e.g.,
v1.0.0
), with full traceability to the versions of its constituent models.
Working Plan
- Verification is performed to confirm the model was developed according to this plan.
- Validation is performed to confirm the model meets the acceptance criteria in
R-TF-028-001
. - The V&V results are documented in the
AI/ML Development Report (R-TF-028-005)
. - The final algorithm package and
AI/ML Release (R-TF-028-006)
are delivered to the software team.
Deliverables
Documentation
- All
R-TF-028-xxx
documents generated, including Description, Development Plan, Reports, and completed V&V checklists.
Algorithm Package
- 1 ICD Category Distribution algorithm (as
.onnx
file). - 1 Binary Indicators configuration (as
.json
mapping file).
AI/ML Risk Management Plan
This plan focuses on risks inherent to the AI/ML development lifecycle, as recorded in R-TF-028-011 AI/ML Risk Matrix
. This process is a key input into the overall device risk management activities governed by ISO 14971.
AI/ML Risk Management Process
- Risk Assessment: Systematically identifying, analyzing, and evaluating risks related to data, model training, and performance.
- Risk Control: Implementing and verifying mitigation measures for all unacceptable risks.
- Monitoring & Review: Continuously reviewing risks throughout the lifecycle.
AI/ML Risk Ranking System
Severity
Severity is based on the potential impact on model performance and its clinical utility.
Ranking | Definition | Severity |
---|---|---|
5 | Degrades model performance to a point of being fundamentally flawed or unsafe (e.g., systematically misclassifies critical conditions). | Catastrophic |
4 | Significantly degrades model performance, making it frequently unreliable or erroneous for its intended task. | Critical |
3 | Moderately degrades model performance, making it often erroneous under specific, plausible conditions. | Moderate |
2 | Slightly degrades model performance, making it sometimes erroneous or showing minor performance loss. | Minor |
1 | Negligibly degrades model performance with no discernible impact on clinical utility. | Negligible |
Likelihood
Likelihood of the risk occurring during development.
Ranking | Definition | Likelihood |
---|---|---|
5 | Almost certain to occur if not controlled. | Very high |
4 | Likely to occur. | High |
3 | May occur. | Moderate |
2 | Unlikely to occur. | Low |
1 | Extremely unlikely to occur. | Very low |
AI/ML Risk Priority Number and Acceptability
Severity →<br>Likelihood ↓ | Negligible (1) | Minor (2) | Moderate (3) | Critical (4) | Catastrophic (5) |
---|---|---|---|---|---|
Very high (5) | Tolerable (5) | Tolerable (10) | Unacceptable (15) | Unacceptable (20) | Unacceptable (25) |
High (4) | Acceptable (4) | Tolerable (8) | Tolerable (12) | Unacceptable (16) | Unacceptable (20) |
Moderate (3) | Acceptable (3) | Tolerable (6) | Tolerable (9) | Tolerable (12) | Unacceptable (15) |
Low (2) | Acceptable (2) | Acceptable (4) | Tolerable (6) | Tolerable (8) | Tolerable (10) |
Very low (1) | Acceptable (1) | Acceptable (2) | Acceptable (3) | Acceptable (4) | Tolerable (5) |
- Acceptable:
RPN ≤ 4
- Tolerable:
5 ≤ RPN ≤ 12
(Requires risk-benefit analysis) - Unacceptable:
RPN ≥ 15
(Requires mitigation)
Safety Risks Related to AI/ML
The AI team is responsible for identifying how AI/ML development risks can contribute to hazardous situations. These "safety risks related to AI/ML" are escalated to the product team for inclusion in the overall Safety Risk Matrix
and are mitigated through a combination of technical controls and user-facing measures, in line with ISO 14971.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001