R-TF-028-004 AI/ML Development Report
Table of contents
Introduction
Context
This report documents the development, verification, and validation of the AI/ML algorithm package for the Legit.Health Plus medical device. The development process was conducted in accordance with the procedures outlined in GP-028 AI Development
and followed the methodologies specified in the R-TF-028-002 AI/ML Development Plan
.
The algorithms are designed as offline (static) models. They were trained on a fixed dataset prior to release and do not adapt or learn from new data after deployment. This ensures predictable and consistent performance in the clinical environment.
Algorithms Description
The algorithm package consists of two core components that work in sequence to fulfill User Requirement REQ_004:
- ICD Category Distribution Algorithm: A deep learning model, based on a Vision Transformer (ViT) architecture, that analyzes a given dermatological image (clinical or dermoscopic). Its output is a normalized probability distribution across [NUMBER OF CATEGORIES] relevant ICD-11 categories. For the user, this is presented as the top five most likely diagnoses.
- Binary Indicator Algorithms: These are not separate trained models but a set of three indicators derived directly from the output of the ICD Category Distribution algorithm. A predefined, expert-curated mapping matrix (defined in
R-TF-028-004
) assigns each of the [NUMBER OF CATEGORIES] ICD-11 categories to one or more indicators. The final value for each indicator is calculated by summing the probabilities of all associated ICD-11 categories. The three indicators are:
- Malignancy
- Critical Complexity
- Dermatological Condition
AI/ML Standalone Evaluation Objectives
The standalone validation aimed to confirm that the final algorithms meet the predefined performance criteria outlined in R-TF-028-001
.
The primary objectives and endpoints were:
Algorithm | Objective | Endpoints | Success Criteria |
---|---|---|---|
ICD Category Distribution | Provide an accurate differential diagnosis suggestion. | Top-1, Top-3, and Top-5 Accuracy on a held-out test set. | Top-1 Accuracy ≥ 55% <br>Top-3 Accuracy ≥ 70% <br>Top-5 Accuracy ≥ 80% |
Binary Indicators | Provide reliable signals for case prioritization and assessment. | Area Under the ROC Curve (AUC) on a held-out test set. | AUC ≥ 0.80 for each of the three indicators. |
Data Management
Collection
The dataset was compiled from two distinct retrospective sources as detailed in the respective data collection instructions:
- Public Datasets (
R-TF-028-003
): Images sourced from reputable online dermatological atlases (e.g., DermNet NZ, ISIC, PAD-UFES-20). - Prospective Clinical Study (
R-TF-028-004
): Images collected under a formal protocol at the Hospital Universitario de Torrejón.
This combined approach resulted in a total dataset of [NUMBER OF IMAGES] RGB images, covering nearly 1,000 different initial categories and a diverse representation of age, sex, and skin phototypes.
Annotation, Truthing, and Consensus
- ICD-11 Labels: The primary diagnostic labels were sourced directly from the datasets, having been provided by medical experts. A thorough curation process was undertaken to standardize all taxonomies to the ICD-11 classification system.
- Binary Indicator Mapping: The ground truth for the binary indicators was established by creating a mapping matrix, as detailed in
R-TF-028-004
. This process involved a board-certified dermatologist assigning each of the final [NUMBER OF CATEGORIES] ICD-11 categories to the three indicators, followed by an independent review and consensus process.
Preparation and Partitioning
The final dataset was partitioned into three distinct sets: training, validation, and testing. To prevent data leakage and ensure an unbiased final evaluation, the split was performed at the subject level where subject IDs were available. For sources without subject IDs, a class-wise split was performed.
Crucially, some data sources, including the entire prospective clinical study dataset from H.U. Torrejón, were sequestered and reserved exclusively for the final test set.
Set | Purpose | Image Count |
---|---|---|
Training | Model fitting and parameter updates. | [Insert Number] |
Validation | Hyperparameter tuning and model selection. | [Insert Number] |
Test | Final, unbiased performance evaluation. | [Insert Number] |
Algorithm Training (ICD Category Distribution)
Pre-processing
Input images were resized to the model's required input dimensions. During training, a rich data augmentation pipeline was applied, including random cropping (guided by annotated bounding boxes where available), rotations, and various pixel transformations (color jittering, histogram equalization, etc.) to increase the diversity of the training data and improve model generalization. No augmentations were applied to the test data.
Design, Training, and Tuning
- Architecture: The selected model is a Vision Transformer (ViT), a state-of-the-art architecture for image recognition.
- Training: The model was trained using transfer learning, initializing with weights pre-trained on a large-scale natural image dataset. The training process utilized the Adam optimizer, a cross-entropy loss function, and a one-cycle learning rate policy to enable super-convergence. Progress was monitored on the validation set to prevent overfitting, using early stopping if performance plateaued.
Post-processing
Two key post-processing steps were implemented to enhance performance and reliability:
- Model Calibration: Temperature scaling was applied to the model's raw outputs. This technique adjusts the softmax function to produce better-calibrated probability distributions, ensuring that the model's confidence scores are more reliable.
- Test-Time Augmentation (TTA): During inference, multiple augmented versions of the input image are created and passed through the model. The resulting probability distributions are then averaged to produce a single, more robust final prediction.
Algorithm Performance Evaluation/Testing
The final, selected algorithm package was evaluated on the sequestered, held-out test set, which was not used at any point during training or model selection.
ICD Category Distribution Performance
The model's ability to correctly identify the ground truth diagnosis was assessed using Top-k accuracy. The results below demonstrate that the algorithm successfully met and exceeded all predefined success criteria.
Metric | Result | Success Criterion | Outcome |
---|---|---|---|
Top-1 Accuracy | 74% | ≥ 55% | PASS |
Top-3 Accuracy | 86% | ≥ 70% | PASS |
Top-5 Accuracy | 90% | ≥ 80% | PASS |
Binary Indicator Performance
The performance of the derived binary indicators was evaluated using the Area Under the ROC Curve (AUC). The ground truth for this evaluation was determined by applying the expert-defined mapping matrix to the ground truth ICD-11 labels of the test set. The results show that all three indicators achieved outstanding performance, well above the acceptance threshold.
Indicator | Result (AUC) | Success Criterion | Outcome |
---|---|---|---|
Malignancy | 0.96 | ≥ 0.80 | PASS |
Critical Complexity | 0.94 | ≥ 0.80 | PASS |
Dermatological Condition | 0.99 | ≥ 0.80 | PASS |
Bias Analysis
An analysis was conducted on the external Diverse Dermatology Images (DDI) dataset to assess performance across different Fitzpatrick skin types. Initial results were consistent with published findings for other devices. However, after manually cropping the images to focus on the region of interest, performance improved across all groups, with the overall AUC for malignancy detection rising from 0.6510 to 0.7627. This highlights the model's robustness and the critical impact of image quality on performance.
Conclusion
The development and validation activities described in this report provide objective evidence that the AI/ML algorithms for Legit.Health Plus meet their predefined specifications and performance requirements.
The ICD Category Distribution algorithm demonstrated high accuracy, significantly exceeding all Top-k endpoints. The derived Binary Indicators proved to be exceptionally effective, achieving outstanding AUC scores.
The development process adhered to the company's QMS and followed Good Machine Learning Practices. The final algorithm package is considered verified, validated, and suitable for release and integration into the Legit.Health Plus medical device.
AI/ML Risks Assessment Report
AI/ML Risk Assessment
A comprehensive risk assessment was conducted throughout the development lifecycle in accordance with the R-TF-028-002 AI/ML Development Plan
. All identified AI/ML-specific risks related to data, model training, and performance were documented and analyzed in the R-TF-028-011 AI/ML Risk Matrix
.
AI/ML Risk Treatment
Control measures were implemented to mitigate all identified risks. Key controls included:
- Rigorous data curation and multi-source collection to mitigate bias.
- Systematic model training and validation procedures to prevent overfitting.
- Use of a sequestered test set to ensure unbiased performance evaluation.
- Implementation of model calibration to improve the reliability of outputs.
Residual AI/ML Risk Assessment
After the implementation of control measures, a residual risk analysis was performed. All identified AI/ML risks were successfully reduced to an acceptable level.
AI/ML Risk and Traceability with Safety Risk
Safety risks related to the AI/ML algorithms (e.g., incorrect diagnosis suggestion, misinterpretation of data) were identified and traced back to their root causes in the AI/ML development process. These safety risks have been escalated for management in the overall device Safety Risk Matrix, in line with ISO 14971.
Conclusion
The AI/ML development process has successfully managed and mitigated inherent risks to an acceptable level. The benefits of using the Legit.Health Plus algorithms as a clinical decision support tool are judged to outweigh the residual risks.
Related Documents
- Project Design and Plan
-
R-TF-028-001 AI/ML Description
-R-TF-028-002 AI/ML Development Plan
-R-TF-028-011 AI/ML Risk Matrix
- Data Collection and Annotation
-
R-TF-028-003 Data Collection Instructions - Public Datasets and Atlases
-R-TF-028-004 Data Collection Instructions - Prospective Clinical Study (H.U. Torrejón)
-R-TF-028-004 Data Annotation Instructions - Binary Indicator Mapping
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001