R-TF-028-005 AI/ML Development Report

Table of contents

Introduction
Data Management
Algorithm Training (ICD Category Distribution)
Algorithm Performance Evaluation/Testing
Conclusion
AI/ML Risks Assessment Report
Related Documents

Introduction

Context

This report documents the development, verification, and validation of the AI/ML algorithm package for the Legit.Health Plus medical device. The development process was conducted in accordance with the procedures outlined in GP-028 AI Development and followed the methodologies specified in the R-TF-028-002 AI/ML Development Plan.

The algorithms are designed as offline (static) models. They were trained on a fixed dataset prior to release and do not adapt or learn from new data after deployment. This ensures predictable and consistent performance in the clinical environment.

Algorithms Description

The algorithm package consists of two core components that work in sequence to fulfill User Requirement REQ_004:

ICD Category Distribution Algorithm: A deep learning model, based on a Vision Transformer (ViT) architecture, that analyzes a given dermatological image (clinical or dermoscopic). Its output is a normalized probability distribution across [NUMBER OF CATEGORIES] relevant ICD-11 categories. For the user, this is presented as the top five most likely diagnoses.
Binary Indicator Algorithms: These are not separate trained models but a set of three indicators derived directly from the output of the ICD Category Distribution algorithm. A predefined, expert-curated mapping matrix (defined in R-TF-028-004) assigns each of the [NUMBER OF CATEGORIES] ICD-11 categories to one or more indicators. The final value for each indicator is calculated by summing the probabilities of all associated ICD-11 categories. The three indicators are:

Malignancy
Critical Complexity
Dermatological Condition

AI/ML Standalone Evaluation Objectives

The standalone validation aimed to confirm that the final algorithms meet the predefined performance criteria outlined in R-TF-028-001.

The primary objectives and endpoints were:

Algorithm	Objective	Endpoints	Success Criteria
ICD Category Distribution	Provide an accurate differential diagnosis suggestion.	Top-1, Top-3, and Top-5 Accuracy on a held-out test set.	Top-1 Accuracy `≥ 55%`<br>Top-3 Accuracy `≥ 70%`<br>Top-5 Accuracy `≥ 80%`
Binary Indicators	Provide reliable signals for case prioritization and assessment.	Area Under the ROC Curve (AUC) on a held-out test set.	AUC `≥ 0.80` for each of the three indicators.

Data Management

Collection

The dataset was compiled from two distinct retrospective sources as detailed in the respective data collection instructions:

Public Datasets (R-TF-028-003): Images sourced from reputable online dermatological atlases (e.g., DermNet NZ, ISIC, PAD-UFES-20).
Prospective Clinical Study (R-TF-028-004): Images collected under a formal protocol at the Hospital Universitario de Torrejón.

This combined approach resulted in a total dataset of [NUMBER OF IMAGES] RGB images, covering nearly 1,000 different initial categories and a diverse representation of age, sex, and skin phototypes.

Annotation, Truthing, and Consensus

ICD-11 Labels: The primary diagnostic labels were sourced directly from the datasets, having been provided by medical experts. A thorough curation process was undertaken to standardize all taxonomies to the ICD-11 classification system.
Binary Indicator Mapping: The ground truth for the binary indicators was established by creating a mapping matrix, as detailed in R-TF-028-004. This process involved a board-certified dermatologist assigning each of the final [NUMBER OF CATEGORIES] ICD-11 categories to the three indicators, followed by an independent review and consensus process.

Preparation and Partitioning

The final dataset was partitioned into three distinct sets: training, validation, and testing. To prevent data leakage and ensure an unbiased final evaluation, the split was performed at the subject level where subject IDs were available. For sources without subject IDs, a class-wise split was performed.

Crucially, some data sources, including the entire prospective clinical study dataset from H.U. Torrejón, were sequestered and reserved exclusively for the final test set.

Set	Purpose	Image Count
Training	Model fitting and parameter updates.	[Insert Number]
Validation	Hyperparameter tuning and model selection.	[Insert Number]
Test	Final, unbiased performance evaluation.	[Insert Number]

Algorithm Training (ICD Category Distribution)

Pre-processing

Input images were resized to the model's required input dimensions. During training, a rich data augmentation pipeline was applied, including random cropping (guided by annotated bounding boxes where available), rotations, and various pixel transformations (color jittering, histogram equalization, etc.) to increase the diversity of the training data and improve model generalization. No augmentations were applied to the test data.

Design, Training, and Tuning

Architecture: The selected model is a Vision Transformer (ViT), a state-of-the-art architecture for image recognition.
Training: The model was trained using transfer learning, initializing with weights pre-trained on a large-scale natural image dataset. The training process utilized the Adam optimizer, a cross-entropy loss function, and a one-cycle learning rate policy to enable super-convergence. Progress was monitored on the validation set to prevent overfitting, using early stopping if performance plateaued.

Post-processing

Two key post-processing steps were implemented to enhance performance and reliability:

Model Calibration: Temperature scaling was applied to the model's raw outputs. This technique adjusts the softmax function to produce better-calibrated probability distributions, ensuring that the model's confidence scores are more reliable.
Test-Time Augmentation (TTA): During inference, multiple augmented versions of the input image are created and passed through the model. The resulting probability distributions are then averaged to produce a single, more robust final prediction.

Algorithm Performance Evaluation/Testing

The final, selected algorithm package was evaluated on the sequestered, held-out test set, which was not used at any point during training or model selection.

ICD Category Distribution Performance

The model's ability to correctly identify the ground truth diagnosis was assessed using Top-k accuracy. The results below demonstrate that the algorithm successfully met and exceeded all predefined success criteria.

Metric	Result	Success Criterion	Outcome
Top-1 Accuracy	74%	`≥ 55%`	PASS
Top-3 Accuracy	86%	`≥ 70%`	PASS
Top-5 Accuracy	90%	`≥ 80%`	PASS

Binary Indicator Performance

The performance of the derived binary indicators was evaluated using the Area Under the ROC Curve (AUC). The ground truth for this evaluation was determined by applying the expert-defined mapping matrix to the ground truth ICD-11 labels of the test set. The results show that all three indicators achieved outstanding performance, well above the acceptance threshold.

Indicator	Result (AUC)	Success Criterion	Outcome
Malignancy	0.96	`≥ 0.80`	PASS
Critical Complexity	0.94	`≥ 0.80`	PASS
Dermatological Condition	0.99	`≥ 0.80`	PASS

Bias Analysis

An analysis was conducted on the external Diverse Dermatology Images (DDI) dataset to assess performance across different Fitzpatrick skin types. Initial results were consistent with published findings for other devices. However, after manually cropping the images to focus on the region of interest, performance improved across all groups, with the overall AUC for malignancy detection rising from 0.6510 to 0.7627. This highlights the model's robustness and the critical impact of image quality on performance.

Conclusion

The development and validation activities described in this report provide objective evidence that the AI/ML algorithms for Legit.Health Plus meet their predefined specifications and performance requirements.

The ICD Category Distribution algorithm demonstrated high accuracy, significantly exceeding all Top-k endpoints. The derived Binary Indicators proved to be exceptionally effective, achieving outstanding AUC scores.

The development process adhered to the company's QMS and followed Good Machine Learning Practices. The final algorithm package is considered verified, validated, and suitable for release and integration into the Legit.Health Plus medical device.

AI/ML Risks Assessment Report

AI/ML Risk Assessment

A comprehensive risk assessment was conducted throughout the development lifecycle in accordance with the R-TF-028-002 AI/ML Development Plan. All identified AI/ML-specific risks related to data, model training, and performance were documented and analyzed in the R-TF-028-011 AI/ML Risk Matrix.

AI/ML Risk Treatment

Control measures were implemented to mitigate all identified risks. Key controls included:

Rigorous data curation and multi-source collection to mitigate bias.
Systematic model training and validation procedures to prevent overfitting.
Use of a sequestered test set to ensure unbiased performance evaluation.
Implementation of model calibration to improve the reliability of outputs.

Residual AI/ML Risk Assessment

After the implementation of control measures, a residual risk analysis was performed. All identified AI/ML risks were successfully reduced to an acceptable level.

AI/ML Risk and Traceability with Safety Risk

Safety risks related to the AI/ML algorithms (e.g., incorrect diagnosis suggestion, misinterpretation of data) were identified and traced back to their root causes in the AI/ML development process. These safety risks have been escalated for management in the overall device Safety Risk Matrix, in line with ISO 14971.

Conclusion

The AI/ML development process has successfully managed and mitigated inherent risks to an acceptable level. The benefits of using the Legit.Health Plus algorithms as a clinical decision support tool are judged to outweigh the residual risks.

Project Design and Plan -R-TF-028-001 AI/ML Description -R-TF-028-002 AI/ML Development Plan -R-TF-028-011 AI/ML Risk Matrix
Data Collection and Annotation -R-TF-028-003 Data Collection Instructions - Public Datasets and Atlases -R-TF-028-004 Data Collection Instructions - Prospective Clinical Study (H.U. Torrejón) -R-TF-028-004 Data Annotation Instructions - Binary Indicator Mapping

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Author: Team members involved
Reviewer: JD-003, JD-004
Approver: JD-001

Introduction​

Context​

Algorithms Description​

AI/ML Standalone Evaluation Objectives​

Data Management​

Collection​

Annotation, Truthing, and Consensus​

Preparation and Partitioning​

Algorithm Training (ICD Category Distribution)​

Pre-processing​

Design, Training, and Tuning​

Post-processing​

Algorithm Performance Evaluation/Testing​

ICD Category Distribution Performance​

Binary Indicator Performance​

Bias Analysis​

Conclusion​

AI/ML Risks Assessment Report​

AI/ML Risk Assessment​

AI/ML Risk Treatment​

Residual AI/ML Risk Assessment​

AI/ML Risk and Traceability with Safety Risk​

Conclusion​

Related Documents​

Introduction

Context

Algorithms Description

AI/ML Standalone Evaluation Objectives

Data Management

Collection

Annotation, Truthing, and Consensus

Preparation and Partitioning

Algorithm Training (ICD Category Distribution)

Pre-processing

Design, Training, and Tuning

Post-processing

Algorithm Performance Evaluation/Testing

ICD Category Distribution Performance

Binary Indicator Performance

Bias Analysis

Conclusion

AI/ML Risks Assessment Report

AI/ML Risk Assessment

AI/ML Risk Treatment

Residual AI/ML Risk Assessment

AI/ML Risk and Traceability with Safety Risk

Conclusion

Related Documents