R-TF-028-005 AI Development Report

Table of contents

Introduction
Data Management
Model Development and Validation
Summary and Conclusion
State of the Art Compliance and Development Lifecycle
AI Risks Assessment Report
Related Documents
- Project Design and Plan
- Data Collection and Annotation

Introduction

Context

This report documents the development, verification, and validation of the AI algorithm package for the Legit.Health Plus medical device. The development process was conducted in accordance with the procedures outlined in GP-028 AI Development and followed the methodologies specified in the R-TF-028-002 AI Development Plan.

The algorithms are designed as offline (static) models. They were trained on a fixed dataset prior to release and do not adapt or learn from new data after deployment. This ensures predictable and consistent performance in the clinical environment.

Algorithms Description

The Legit.Health Plus device incorporates 31 AI models that work together to fulfill the device's intended purpose. A comprehensive description of all models, their clinical objectives, and performance specifications is provided in R-TF-028-001 AI/ML Description.

The AI algorithm package includes:

Clinical Models (directly fulfilling the intended purpose):

ICD Category Distribution and Binary Indicators (1 model): Provides interpretative distribution of ICD-11 categories and binary risk indicators (malignancy, pre-malignant, associated with malignancy, pigmented lesion, urgent referral, high-priority referral).
Visual Sign Intensity Quantification Models (10 models): Quantify the intensity of clinical signs including erythema, desquamation, induration, pustule, crusting, xerosis, swelling, oozing, excoriation, and lichenification.
Wound Characteristic Assessment (1 model): Evaluates wound tissue types and characteristics.
Lesion Quantification Models (4 models):
- Inflammatory Nodular Lesion Quantification
- Acneiform Lesion Type Quantification (multi-class detection of papules, pustules, comedones, nodules, cysts)
- Inflammatory Lesion Quantification
- Hive Lesion Quantification
Surface Area Quantification Models (6 models):
- Body Surface Segmentation
- Wound Surface Quantification
- Hair Loss Surface Quantification
- Nail Lesion Surface Quantification
- Hypopigmentation/Depigmentation Surface Quantification
- Surface Area Quantification (generic measurement model)
Pattern Identification Models (4 models):
- Acneiform Inflammatory Pattern Identification
- Follicular and Inflammatory Pattern Identification
- Inflammatory Pattern Identification
- Inflammatory Pattern Indicator

Non-Clinical Models (supporting proper functioning - 5 models):

Dermatology Image Quality Assessment (DIQA): Ensures image quality is suitable for analysis.
Fitzpatrick Skin Type Identification: Identifies skin phototype to support equity and bias mitigation.
Domain Validation: Verifies images are within the validated domain.
Skin Surface Segmentation: Identifies skin regions for analysis.
Body Site Identification: Determines anatomical location.

Total: 26 Clinical Models + 5 Non-Clinical Models = 31 Models

This report focuses on the development methodology, data management processes, and validation results for all models. Each model shares a common data foundation but may require specific annotation procedures as detailed in the respective data annotation instructions.

AI Standalone Evaluation Objectives

The standalone validation aimed to confirm that all AI models meet their predefined performance criteria as outlined in R-TF-028-001 AI/ML Description.

Performance specifications and success criteria vary by model type and are detailed in the individual model sections of this report. All models were evaluated on independent, held-out test sets that were not used during training or model selection.

Data Management

Overview

The development of all AI models in the Legit.Health Plus device relies on a comprehensive dataset compiled from multiple sources and annotated through a multi-stage process. This section describes the general data management workflow that applies to all models, including collection, foundational annotation (ICD-11 mapping), and partitioning. Model-specific annotation procedures are detailed in the individual model sections.

Data Collection

The dataset was compiled from multiple distinct sources as detailed in R-TF-028-003 Data Collection Instructions - Custom Gathered Data and R-TF-028-003 Data Collection Instructions - Archive Data:

Archive Data: Images sourced from reputable online sources and private institutions.
Custom Gathered DAta: Images collected under formal protocols at clinical sites.

This combined approach resulted in a comprehensive dataset covering diverse demographic characteristics (age, sex, Fitzpatrick skin types I-VI), anatomical sites, imaging conditions, and pathological conditions.

Dataset summary:

Total images: [NUMBER OF IMAGES] (to be completed)
Sources: 17
ICD-11 categories: [NUMBER OF CATEGORIES] (to be completed)
Demographic diversity: Ages [AGE RANGE], Fitzpatrick types I-VI, global geographic representation

ID	Dataset Name	Type	Description	ICD-11 Mapping	Crops	Diff. Dx	Sex	Age
1	Torrejon-HCP-diverse-conditions	Multiple	Dataset of skin images by physicians with good photographic skills	✓ Yes	Varies	✓	✓	✓
2	Abdominal-skin	Archive	Small dataset of abdominal pictures with segmentation masks for `Non-specific lesion` class	✗ No	Yes (programmatic)	—	—	—
3	Basurto-Cruces-Melanoma	Custom gathered	Clinical validation study dataset (`MC EVCDAO 2019`)	✓ Yes	Yes (in-house crops)	—	✓	✓
4	BI-GPP (batch 1)	Archive	Small set of GPP images from Boehringer Ingelheim (first batch)	✓ Yes	No	—	—	—
5	BI-GPP (batch 2)	Archive	Large dataset of GPP images from Boehringer Ingelheim (second batch)	✓ Yes	Yes (programmatic)	—	✓	✓
6	Chiesa-dataset	Archive	Sample of head and neck lesions (Medela et al., 2024)	✓ Yes	Yes (in-house crops)	—	◐	◐
7	Figaro 1K	Archive	Hair style classification and segmentation dataset, repurposed for `Non-specific finding`	✗ No	Yes (in-house crops)	—	—	—
8	Hand Gesture Recognition (HGR)	Archive	Small dataset of hands repurposed for non-specific images	✗ No	Yes (programmatic)	—	—	—
9	IDEI 2024 (pigmented)	Archive	Prospective and retrospective studies at IDEI (DERMATIA project), pigmented lesions only	✓ Yes	Yes (programmatic)	—	✓	◐
10	Manises-HS	Archive	Large collection of hidradenitis suppurativa images	✗ No	Not yet	—	✓	✓
11	Nails segmentation	Archive	Small nail segmentation dataset repurposed for `non-specific lesion`	✗ No	Yes (programmatic)	—	—	—
12	Non-specific lesion V2	Archive	Small representative collection repurposed for `non-specific lesion`	✗ No	Yes (programmatic)	—	—	—
13	Osakidetza-derivation	Archive	Clinical validation study dataset (`DAO Derivación O 2022`)	✓ Yes	Yes (in-house crops)	◐	✓	✓
14	Ribera ulcers	Archive	Collection of ulcer images from Ribera Salud	✗ No	Yes (from wound masks, not all)	—	—	—
15	Transient Biometrics Nails V1	Archive	Biometric dataset of nail images	✗ No	Yes (programmatic)	—	—	—
16	Transient Biometrics Nails V2	Archive	Biometric dataset of nail images	✗ No	No (close-ups)	—	—	—
17	WoundsDB	Archive	Small chronic wounds database	✓ Yes	No	—	✓	◐
18	Clinica Dermatologica Internacional - Acne	Custom gathered	Compilation of images from CDI's acne patients with IGA labels	✓ Yes	No	—	—	—

Total datasets: 52 | With ICD-11 mapping: 38

Legend: ✓ = Yes | ◐ = Partial/Pending | — = No

Foundational Annotation: ICD-11 Mapping

Before any model-specific training could begin, all diagnostic labels across all data sources were standardized to the ICD-11 classification system. This foundational annotation step is required for all models and is detailed in R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping.

The ICD-11 mapping process involved:

Label Extraction: Extracting all unique diagnostic labels from each data source
Standardization: Mapping source-specific labels (abbreviations, alternative spellings, legacy coding systems) to standardized ICD-11 categories
Clinical Validation: Expert dermatologist review and validation of all mappings
Visible Category Consolidation: Grouping ICD-11 codes that cannot be reliably distinguished based on visual features alone into unified "Visible ICD-11" categories

This standardization ensures:

Consistent diagnostic ground truth across all data sources
Clinical validity and regulatory compliance (ICD-11 is the WHO standard)
Proper handling of visually similar conditions that require additional clinical information for differentiation
A unified diagnostic vocabulary for the ICD Category Distribution model and all other clinical models

Key outputs:

Master ICD-11 mapping matrix linking all source labels to standardized categories
Documentation of clinical rationale for category consolidation decisions
Version-controlled ground truth diagnostic classification for the entire dataset

(to be completed)

(include the csv files detailed in the R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping)

Model Development and Validation

This section details the development, training, and validation of all AI models in the Legit.Health Plus device. Each model subsection includes:

Model-specific data annotation requirements
Training methodology and architecture
Performance evaluation results
Bias analysis and fairness considerations

ICD Category Distribution and Binary Indicators

Model Overview

Reference: R-TF-028-001 AI/ML Description - ICD Category Distribution and Binary Indicators section

The ICD Category Distribution model is a deep learning classifier that outputs a probability distribution across ICD-11 disease categories. The Binary Indicators are derived from this distribution using an expert-curated mapping matrix.

Models included:

ICD Category Distribution (outputs top-5 diagnoses with probabilities)
Binary Indicators (6 derived indicators):
- Malignant
- Pre-malignant
- Associated with malignancy
- Pigmented lesion
- Urgent referral (≤48h)
- High-priority referral (≤2 weeks)

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed via R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping)

All images in the training, validation, and test sets were annotated with standardized ICD-11 diagnostic labels following the comprehensive mapping process described in the Data Management section.

Binary Indicator Mapping: A dermatologist-validated mapping matrix was created to link each ICD-11 category to the six binary indicators. This mapping defines which disease categories contribute to each indicator (e.g., melanoma, squamous cell carcinoma, and basal cell carcinoma all contribute to the "Malignant" indicator).

Dataset statistics:

Total images with ICD-11 labels: [NUMBER] (to be completed)
Number of ICD-11 categories: [NUMBER] (to be completed)
Training set: [NUMBER] images
Validation set: [NUMBER] images
Test set: [NUMBER] images

Training Methodology

Pre-processing:

Input images resized to model-required dimensions
Data augmentation during training: random cropping (guided by bounding boxes where available), rotations, color jittering, histogram equalization
No augmentation applied to test data

Architecture: [VIT or EfficientNet - to be determined]

Vision Transformer (ViT) or EfficientNet architecture
Transfer learning from large-scale pre-trained weights

Training:

Optimizer: Adam
Loss function: Cross-entropy
Learning rate policy: One-cycle policy for super-convergence
Early stopping based on validation set performance
Training duration: [NUMBER] epochs

Post-processing:

Temperature scaling for probability calibration
Test-time augmentation (TTA) for robust predictions

Performance Results

ICD Category Distribution Performance:

Metric	Result	Success Criterion	Outcome
Top-1 Accuracy	[TO FILL]	`≥ 55%`	[PENDING]
Top-3 Accuracy	[TO FILL]	`≥ 70%`	[PENDING]
Top-5 Accuracy	[TO FILL]	`≥ 80%`	[PENDING]

Binary Indicator Performance:

Indicator	Result (AUC)	Success Criterion	Outcome
Malignant	[TO FILL]	`≥ 0.80`	[PENDING]
Pre-malignant	[TO FILL]	`≥ 0.80`	[PENDING]
Associated with malignancy	[TO FILL]	`≥ 0.80`	[PENDING]
Pigmented lesion	[TO FILL]	`≥ 0.80`	[PENDING]
Urgent referral	[TO FILL]	`≥ 0.80`	[PENDING]
High-priority referral	[TO FILL]	`≥ 0.80`	[PENDING]

Verification and Validation Protocol

Test Design:

Held-out test set sequestered from training and validation
Stratified sampling to ensure representation across ICD-11 categories
Independent evaluation on external datasets (DDI, clinical study data)

Complete Test Protocol:

Input: RGB images from test set
Processing: Model inference with TTA
Output: ICD-11 probability distribution and binary indicator scores
Ground truth comparison: Expert-labeled ICD-11 categories and binary mappings
Statistical analysis: Top-k accuracy, AUC-ROC with 95% confidence intervals

Data Analysis Methods:

Top-k accuracy calculation with bootstrapping for confidence intervals
ROC curve analysis and AUC calculation for binary indicators
Confusion matrix analysis for error pattern identification
Statistical significance testing (DeLong test for AUC comparisons)

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Evaluate model performance across demographic subpopulations to identify and mitigate potential biases that could affect clinical safety and effectiveness.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance metrics (Top-k accuracy, AUC) disaggregated by Fitzpatrick types I-VI
Datasets: DDI dataset, internal test set with Fitzpatrick annotations
Statistical comparison: Chi-square test for performance differences across groups
Success criterion: No statistically significant performance degradation (p < 0.05) in any Fitzpatrick type below overall acceptance thresholds

2. Age Group Analysis:

Stratification: Pediatric (under 18 years), Adult (18-65 years), Elderly (over 65 years)
Metrics: Top-k accuracy and AUC per age group
Data sources: Clinical study datasets with age metadata
Success criterion: Performance within ±10% across age groups

3. Anatomical Site Analysis:

Site categories: Face, trunk, extremities, intertriginous areas, acral sites
Evaluation: Top-k accuracy per anatomical location
Success criterion: No anatomical site with performance below acceptance threshold

4. Sex/Gender Analysis:

Performance comparison between male and female subgroups
Statistical testing for significant differences
Success criterion: No gender-based performance disparity >5%

5. Image Quality Impact:

Analysis of performance degradation with varying image quality (DIQA scores)
Identification of quality thresholds for reliable predictions
Mitigation: DIQA-based rejection criteria for low-quality images

6. Rare Condition Representation:

Analysis of performance on rare vs. common ICD-11 categories
Class-wise sensitivity and specificity reporting
Mitigation strategies for underrepresented conditions

Bias Mitigation Strategies:

Multi-source data collection ensuring demographic diversity
Fitzpatrick type identification for bias monitoring
Data augmentation targeting underrepresented subgroups
Threshold optimization per subpopulation if necessary
Clinical validation with diverse patient populations

Results Summary: (to be completed after bias analysis)

Subpopulation	Metric	Result	Comparison to Overall	Assessment
Fitzpatrick I-II	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick III-IV	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Age: Pediatric	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Age: Adult	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Age: Elderly	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Sex: Male	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Sex: Female	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Site: Face	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Site: Trunk	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]
Site: Extremities	Top-5 Acc.	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis Conclusion: (to be completed)

Erythema Intensity Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Erythema Intensity Quantification section

This model quantifies erythema (redness) intensity on an ordinal scale (0-9), outputting a probability distribution that is converted to a continuous severity score via weighted expected value calculation.

Clinical Significance: Erythema is a cardinal sign of inflammation in numerous dermatological conditions including psoriasis, atopic dermatitis, and other inflammatory dermatoses.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Erythema intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with erythema intensity scores following standardized clinical scoring protocols (e.g., Clinician's Erythema Assessment scale). Annotations include:

Ordinal intensity scores (0-9): 0 = none, 9 = maximum erythema
Multi-annotator consensus for ground truth establishment (minimum 2-3 dermatologists per image)
Quality control and senior dermatologist review for ambiguous cases

Dataset statistics:

Images with erythema annotations: [NUMBER] (to be completed)
Training set: [NUMBER] images
Validation set: [NUMBER] images
Test set: [NUMBER] images
Average inter-annotator agreement (ICC): [VALUE] (to be completed)
Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology

Architecture: [CNN-based or ViT-based - to be determined]

Deep learning model tailored for ordinal regression
Transfer learning from pre-trained weights (ImageNet or domain-specific)
Input size: [SIZE] pixels

Training approach:

Loss function: Ordinal cross-entropy or weighted expected value optimization
Optimizer: Adam with learning rate [LR]
Data augmentation: Rotations, color jittering (carefully controlled to preserve erythema characteristics), cropping
Regularization: Dropout, weight decay
Training duration: [NUMBER] epochs with early stopping

Post-processing:

Weighted expected value calculation for continuous score
Probability calibration if needed
Output range: 0-9 continuous scale

Performance Results

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 20% (performance superior to inter-observer variability)

Metric	Result	Success Criterion	Outcome
RMAE (Overall)	[TO FILL]	`≤ 20%`	[PENDING]
Pearson Correlation	[TO FILL]	`≥ 0.85`	[PENDING]
Expert Inter-observer ICC	[TO FILL]	Reference	N/A
Model vs. Expert ICC	[TO FILL]	`≥ Expert ICC`	[PENDING]

Verification and Validation Protocol

Test Design:

Independent test set with multi-annotator ground truth (minimum 3 dermatologists per image)
Comparison against expert consensus (mean of expert scores)
Evaluation across diverse conditions (psoriasis, eczema, rosacea), Fitzpatrick skin types, and anatomical sites

Complete Test Protocol:

Input: RGB images from test set with expert erythema intensity annotations
Processing: Model inference with probability distribution output
Output: Continuous erythema severity score (0-9) via weighted expected value
Ground truth: Consensus intensity score from multiple expert dermatologists
Statistical analysis: RMAE, ICC, Pearson/Spearman correlation, Bland-Altman analysis

Data Analysis Methods:

RMAE calculation: Relative Mean Absolute Error comparing model predictions to expert consensus
Inter-observer variability measurement (ICC among experts as benchmark)
Correlation analysis: Pearson and Spearman correlation coefficients
Bland-Altman plots for agreement assessment
Bootstrap resampling (1000 iterations) for 95% confidence intervals
Subgroup analysis for bias detection

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Ensure erythema quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types where erythema visualization varies.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for erythema):

RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
Recognition that erythema contrast decreases with increasing melanin content
Comparison of model performance vs. expert inter-observer variability per skin type
Success criterion: RMAE ≤ 20% maintained across all skin types

2. Disease Condition Analysis:

Performance per condition: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, cellulitis
Disease-specific annotation challenges and inter-observer variability
Success criterion: Model performance better than or equal to expert variability for each condition

3. Anatomical Site Analysis:

Site-specific performance: Face, trunk, extremities, intertriginous areas
Recognition of site-specific visualization challenges (shadows, curvature)
Success criterion: No site with RMAE > 25%

4. Severity Range Analysis:

Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
Detection of ceiling or floor effects
Success criterion: Consistent RMAE across severity levels

5. Image Quality Impact:

RMAE correlation with DIQA scores
Performance degradation with poor lighting/focus
Mitigation: DIQA-based quality filtering

6. Age Group Analysis:

Performance in pediatric, adult, elderly populations
Age-related skin changes (thinner skin, vascular changes)
Success criterion: No age group with significantly degraded performance

Bias Mitigation Strategies:

Training data balanced across Fitzpatrick types (minimum 20% representation of types V-VI)
Fitzpatrick-specific data augmentation
Potential Fitzpatrick-conditional model calibration
Collaborative training with other chromatic intensity models (desquamation, induration)

Results Summary: (to be completed after bias analysis)

Subpopulation	RMAE	Expert ICC	Model vs Expert	Assessment
Fitzpatrick I-II	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick III-IV	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Psoriasis	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Atopic Dermatitis	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Rosacea	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Mild Severity (0-3)	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Moderate Severity (4-6)	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Severe Severity (7-9)	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis Conclusion: (to be completed)

Desquamation Intensity Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Desquamation Intensity Quantification section

This model quantifies desquamation (scaling/peeling) intensity on an ordinal scale (0-9), critical for assessment of psoriasis, seborrheic dermatitis, and other scaling conditions.

Clinical Significance: Desquamation is one of the three cardinal signs in PASI scoring for psoriasis and a key indicator in many inflammatory dermatoses.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Desquamation intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Dataset statistics: (to be completed)

Training Methodology

Architecture: (to be determined)

Performance Results

Metric	Result	Success Criterion	Outcome
RMAE (Overall)	[TO FILL]	`≤ 20%`	[PENDING]
Pearson Correlation	[TO FILL]	`≥ 0.85`	[PENDING]

Verification and Validation Protocol

(Follow same comprehensive protocol as Erythema model)

Bias Analysis and Fairness Evaluation

Subpopulation Analysis: Fitzpatrick types, disease conditions (psoriasis, eczema, seborrheic dermatitis), anatomical sites, severity ranges.

Results Summary: (to be completed)

Induration Intensity Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Induration Intensity Quantification section

This model quantifies induration (plaque thickness/elevation) on an ordinal scale (0-9), essential for psoriasis PASI scoring and assessment of infiltrative conditions.

Clinical Significance: Induration reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Induration intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Dataset statistics: (to be completed)

Training Methodology

Performance Results

Metric	Result	Success Criterion	Outcome
RMAE (Overall)	[TO FILL]	`≤ 20%`	[PENDING]

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Wound Characteristic Assessment

Model Overview

Reference: R-TF-028-001 AI/ML Description - Wound Characteristic Assessment section

This model assesses wound characteristics including tissue types (granulation, slough, necrotic, epithelial), wound bed appearance, exudate level, and other clinically relevant features for comprehensive wound assessment.

Clinical Significance: Accurate wound characterization is essential for wound care planning, treatment selection, and healing progress monitoring.

Data Requirements and Annotation

Dataset statistics: (to be completed)

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Inflammatory Nodular Lesion Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Inflammatory Nodular Lesion Quantification section

This model uses object detection to count inflammatory nodular lesions, critical in scores like IHS4, Hurley staging, and HS-PGA.

Clinical Significance: inflammatory nodular lesion counting is essential for the hidradenitis assessment, treatment response monitoring, and clinical trial endpoints.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

A single medical expert with extended experience and specialization in hidradenitis suppurativa drew bounding boxes around each discrete nodular lesion:

Tight rectangles containing entire nodule with minimal background
Rectangles are oriented to minimize area while fully enclosing the lesion.
Rectangles are defined by their four corner coordinates (x1, y1, x2, y2, x3, y3, x4, y4).
Individual boxes for overlapping but clinically distinguishable nodules
Complete coverage of all nodules in each image

Dataset statistics:

Images with inflammatory nodular annotations: 192
Training set: 153 images
Validation set: 39 images
Train and validation splits contain images from distinct patients to avoid data leakage.
Conditions represented: hidradenitis suppurativa stages I-III and images with healed hidradenitis suppurativa.

Training Methodology

Architecture: YOLOv11-M model

Deep learning model tailored for multi-class object detection.
The version used allows the detection of oriented bounding boxes.
Transfer learning from pre-trained weights (COCO dataset)
Input size: 512x512 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
Batch size: 8
Training duration: 70 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

Input images were resized and padded to 512x512 pixels.
Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

Confidence threshold of 0.3 to filter low-confidence predictions.
Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Performance Results

Performance is evaluated using Relative Mean Absolute Error (rMAE) to account for the correct count of inflammatory nodular lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as rMAE ≤ 0.45 for each inflammatory nodular lesion type to account for a counting performance non-inferior to the estimated inter-observer variability of experts assessing inflammatory nodular lesions.

Lesion type	Metric	Result	Success Criterion	Outcome
Abscess	rMAE	0.37 (0.23-0.53)	≤ 0.45	PASS
Draining Tunnel	rMAE	0.35 (0.22-0.50)	≤ 0.45	PASS
Nodule	rMAE	0.40 (0.26-0.54)	≤ 0.45	PASS
Non-Draining Tunnel	rMAE	0.30 (0.15-0.45)	≤ 0.45	PASS

Verification and Validation Protocol

Test Design:

Images are annotated by an expert dermatologist with a high specialization in hidradenitis suppurativa.
Evaluation images present diverse I-IV Fitzpatrick skin types and severity levels.

Complete Test Protocol:

Input: RGB images from the validation set with expert inflammatory nodule annotations.
Processing: Object detection inference with NMS.
Output: Predicted bounding boxes with confidence scores and lesion type counts.
Ground truth: Expert-annotated boxes and manual inflammatory nodule counts.
Statistical analysis: rMAE.

Data Analysis Methods:

Precision-Recall and F1-confidence curves.
mAP calculation at IoU=0.5 (mAP@50).
rMAE calculation comparing predicted counts to expert counts.

Test Conclusions:

The model met all success criteria, demonstrating sufficient inflammatory nodule lesion detection count performance and suitable for clinical inflammatory nodule severity assessment.
The model demonstrates a mean performance non-inferior than the estimated inter-observer variability of experts assessing inflammatory nodules.
Performance intervals exceed the success criterion in most lesion types, highlighting the need for further data collection, to ensure a more robust analysis of the model.
The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: Ensure inflammatory nodule detection performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
Success criterion: rMAE ≤ 0.45.

Subpopulation	Lesion type	Num. training images	Num. val images	rMAE	Outcome
Fitzpatrick I-II	Abscess	85	22	0.37 (0.23-0.53)	PASS
	Draining tunnel	85	22	0.35 (0.22-0.50)	PASS
	Nodule	85	22	0.40 (0.26-0.54)	PASS
	Non-draining tunnel	85	22	0.30 (0.15-0.45)	PASS
Fitzpatrick III-IV	Abscess	68	17	0.48 (0.27-0.68)	FAIL
	Draining tunnel	68	17	0.35 (0.17-0.55)	PASS
	Nodule	68	17	0.43 (0.25-0.63)	PASS
	Non-draining tunnel	68	17	0.27 (0.09-0.47)	PASS
Fitzpatrick V-VI	Abscess	0	0	N/A	N/A
	Draining tunnel	0	0	N/A	N/A
	Nodule	0	0	N/A	N/A
	Non-draining tunnel	0	0	N/A	N/A

Results Summary:

The model demonstrated consistent performance across Fitzpatrick skin types I-II and III-IV, with all lesion types meeting the success criterion except for abscesses in type III-IV, which slightly exceeded the rMAE threshold.
Confidence intervals for some subpopulations exceeded the succeed criteria due to limited sample sizes. More validation data is required to draw definitive conclusions.
Further data collection is required to enhance performance in underrepresented skin types.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training.
Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types, with most success criteria met.
No significant performance disparities were observed, indicating fairness in acneiform inflammatory lesion detection.
Confidence intervals exceeding success criteria highlight the need for additional data collection.
Continued efforts to collect diverse data, especially for underrepresented groups like dark Fitzpatrick skin tones, will further enhance model robustness and fairness.

Acneiform Lesion Type Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Acneiform Lesion Type Quantification section

This is a single multi-class object detection model that detects and counts simultaneously different types of acneiform lesions: e.g., papules, pustules, comedones, nodules, cysts, scabs, spots. The model outputs bounding boxes with associated class labels and confidence scores for each detected lesion, enabling comprehensive acne severity assessment.

Clinical Significance: This unified model provides complete acneiform lesion profiling essential for acne grading systems (e.g., Global Acne Grading System, Investigator's Global Assessment) and treatment selection. By detecting all lesion types in a single inference, it ensures consistent assessment across lesion categories.

Data Requirements and Annotation

Foundational annotation: 311 images extracted from the ICD-11 mapping related to acne affections and non-specific finding pathologies in the face.

Model-specific annotation: Bounding box annotations for all acneiform lesion types (R-TF-028-004 Data Annotation Instructions - Count annotation)

Three medical experts specialized in acne drew bounding boxes around each discrete lesion and assigned class labels:

Papules: Inflammatory, raised lesions without pus (typically less than 5mm)
Pustules: Pus-filled inflammatory lesions
Comedones: Open (blackheads) and closed (whiteheads) comedones
Nodules: Large, deep inflammatory lesions (greater than or equal to 5mm)
Cysts: Large, fluid-filled lesions (most severe form)
Spots: Post-inflammatory hyperpigmentation or erythema, residual discoloration after a lesion has healed
Scabs: Dried exudate (serum, blood, or pus) forming a crust over a healing or excoriated lesion

Each image is annotated by a single expert, but a subset of 25 images that was annotated by all three annotators to later assess its inter-rater variability.

Annotation guidelines:

Tight rectangles containing entire lesion with minimal background
Individual boxes for overlapping but distinguishable lesions
Complete coverage of all lesions in each image
Nodules and cysts are considered as a single class due to their similar appearance

Dataset statistics:

Images with acneiform lesion: 266
Images with no acneiform lesions: 45
Training set: 234 images
Validation set: 77 images
Acne severity range: Clear to severe
Anatomical sites: Face
Inter-rater relative Mean Absolute Error (rMAE) variability in the 25 images subset:

Lesion type	rMAE
Comedo	0.52 (0.33 - 0.70)
Nodule or cyst	0.25 (0.05 - 0.48)
Papule	0.72 (0.46 - 0.96)
Pustule	0.40 (0.17 - 0.68)
Scab	0.38 (0.12 - 0.64)
Spot	0.66 (0.28 - 0.90)

Training Methodology

Architecture: YOLOv11-M model

Deep learning model tailored for multi-class object detection.
Transfer learning from pre-trained weights (COCO dataset).
Input size: 896x896 pixels.

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
Batch size: 16
Training duration: 95 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

Input images were resized and padded to 896x896 pixels.
Data augmentation: geometric, color, light, and CutMix augmentations.

Post-processing:

Confidence threshold of 0.15 to filter low-confidence predictions.
Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Performance Results

Performance is evaluated using Relative Mean Absolute Error (rMAE) to account for the correct count of acneiform lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). The success criteria are established based on the inter-rater variability observed among experts for each distinct lesion type. This approach aims to assess the model's non-inferiority compared to human expert performance.

Lesion type	Metric	Result	Success criterion	Outcome
Comedo	rMAE	0.62 (0.52-0.72)	≤ 0.70	PASS
Nodule or cyst	rMAE	0.33 (0.24-0.42)	≤ 0.48	PASS
Papule	rMAE	0.58 (0.49-0.67)	≤ 0.96	PASS
Pustule	rMAE	0.28 (0.19-0.37)	≤ 0.68	PASS
Scab	rMAE	0.27 (0.17-0.37)	≤ 0.64	PASS
Spot	rMAE	0.58 (0.50-0.67)	≤ 0.90	PASS

Verification and Validation Protocol

Test Design:

Images are annotated by expert dermatologists with a experience in acne.
Evaluation images present diverse Fitzpatrick skin types and severity levels.

Complete Test Protocol:

Input: RGB images from the validation set with expert acneiform lesion annotations.
Processing: Object detection inference with NMS.
Output: Predicted bounding boxes with confidence scores and lesion type counts.
Ground truth: Expert-annotated boxes and manual acneiform lesion counts.
Statistical analysis: rMAE.

Data Analysis Methods:

Precision-Recall and F1-confidence curves.
mAP calculation at IoU=0.5 (mAP@50).
rMAE calculation comparing predicted counts to expert counts.

Test Conclusions:

The model demonstrates a mean performance non-inferior than the estimated inter-observer variability of experts assessing acneiform lesions.
Only the upper come interval exceed the success criterion, highlighting the need for further data collection, to ensure a more robust analysis of the model.
The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: Ensure the multi-class acneiform lesion detection model performs consistently across demographic subpopulations for all five lesion types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
Success criteria is as in the base evaluation.

Subpopulation	Lesion type	Num. training images	Num. val images	rMAE	Success criterion	Outcome
Fitzpatrick I-II	Comedo	118	37	0.56 (0.41-0.72)	≤ 0.70	PASS
	Nodule or Cyst	118	37	0.29 (0.16-0.43)	≤ 0.48	PASS
	Papule	118	37	0.51 (0.38-0.63)	≤ 0.96	PASS
	Pustule	118	37	0.24 (0.12-0.37)	≤ 0.68	PASS
	Scab	118	37	0.19 (0.07-0.31)	≤ 0.64	PASS
	Spot	118	37	0.49 (0.36-0.62)	≤ 0.90	PASS
Fitzpatrick III-IV	Comedo	89	34	0.72 (0.60-0.83)	≤ 0.70	PASS
	Nodule or Cyst	89	34	0.41 (0.26-0.57)	≤ 0.48	PASS
	Papule	89	34	0.66 (0.54-0.77)	≤ 0.96	PASS
	Pustule	89	34	0.32 (0.19-0.47)	≤ 0.68	PASS
	Scab	89	34	0.37 (0.22-0.52)	≤ 0.64	PASS
	Spot	89	34	0.66 (0.54-0.78)	≤ 0.90	PASS
Fitzpatrick V-VI	Comedo	28	6	0.48 (0.15-0.81)	≤ 0.70	PASS
	Nodule or Cyst	28	6	N/A	≤ 0.48	N/A
	Papule	28	6	0.54 (0.18-0.87)	≤ 0.96	PASS
	Pustule	28	6	0.28 (0.00-0.61)	≤ 0.68	PASS
	Scab	28	6	N/A	≤ 0.64	N/A
	Spot	28	6	0.65 (0.37-0.93)	≤ 0.90	PASS

Results Summary:

The model demonstrated consistent performance across all Fitzpatrick skin tones and all lesion types, with a mean performance non-inferior than the estimated inter-observer variability of experts assessing acneiform lesions.
Confidence intervals for comedos exceeded the succeed criteria highlighting the need for further data collection, to ensure a more robust train and analysis of the model.
Confidence intervals in subpopulations like nodule or cyst for Fitzpatrick III-IV and spot for Fitzpatrick V-VI exceeded the succeed criteria, highlighting the need for further data collection to ensure a more robust train and analysis of the model.
Further data collection is required to analyze the performance in underrepresented skin types.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training.
Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types, with most success criteria met.
No significant performance disparities were observed, indicating fairness in acneiform inflammatory lesion detection.
Confidence intervals exceeding success criteria highlight the need for additional data collection.
Continued efforts to collect diverse data, especially for underrepresented groups like dark Fitzpatrick skin tones, will further enhance model robustness and fairness.

Acneiform Inflammatory Lesion Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Acneiform Inflammatory Lesion Quantification section

This AI model detects and counts acneiform inflammatory lesions.

Clinical Significance: Accurate counting of acneiform inflammatory lesions is essential for acne severity assessment and treatment monitoring.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Image annotations are sourced from the original datasets, which were performed by trained annotators fol3.lowing standardized clinical annotation protocols. Annotations consist on bounding boxes, i.e., tight rectangles around each discrete lesion with minimal background. Rectangles are defined by their four corner coordinates (x_min, y_min, x_max, y_max). Depending on the dataset, annotations discern between different types of acneiform inflammatory lesions (e.g., papules, pustules, comedones) or group them under a single "acneiform inflammatory lesion" category. This model focuses on counting all acneiform inflammatory lesions, regardless of type.

Dataset statistics:

Images with acneiform lesions: 2098, including diverse types of acneiform inflammatory lesions (e.g., papules, pustules, comedones) obtained from the main dataset by filtering for acne-related ICD-11 codes.
Images with no acneiform lesions: 446, including images of healthy skin and images of textures that may resemble acneiform lesions but do not contain true acneiform inflammatory lesions.
Number of subjects: N/A
Training set: 2125 images
Validation set: 419 images

Training Methodology

Architecture: YOLOv11-M model

Deep learning model tailored for single-class object detection.
Transfer learning from pre-trained weights (COCO dataset)
Input size: 640x640 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
Batch size: 32
Training duration: 95 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

Input images were resized and padded to 640x640 pixels.
Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

Confidence threshold of 0.2 to filter low-confidence predictions.
Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Performance Results

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for the correct location of lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as mAP@50 ≥ 0.21 to account for a detection performance non-inferior to previously published acne lesion detection studies.

Metric	Result	Success Criterion	Outcome
mAP@50	0.45 (0.43-0.47)	≥ 0.21	PASS

Verification and Validation Protocol

Test Design:

Annotations sourced from the original datasets are used as gold standard for validation.
Images with lower size that the model input size are excluded from the validation set. The subset size after filtering is 322 images.
Evaluation across diverse skin tones, and severity levels.

Complete Test Protocol:

Input: RGB images from the validation set with acneiform inflammatory lesion annotations.
Processing: Object detection inference with NMS.
Output: Predicted bounding boxes with confidence scores and acneiform inflammatory lesion counts.
Ground truth: Expert-annotated boxes and manual acneiform inflammatory lesion counts.
Statistical analysis: mAP@50.

Data Analysis Methods:

Precision-Recall and F1-confidence curves.
mAP calculation at IoU=0.5 (mAP@50).

Test Conclusions:

The model met all success criteria, demonstrating reliable acneiform inflammatory lesion detection and suitable for clinical acne severity assessment.
The model demonstrates non-inferiority to previously published acne lesion detection studies.
The model's performance is within acceptable limits.
The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: Ensure acneiform inflammatory lesion detection performs consistently across demographic subpopulations and disease severity levels.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
Success criterion: mAP@50 ≥ 0.21.

Subpopulation	Num. training images	Num. val images	mAP@50	Outcome
Fitzpatrick I-II	797	134	0.42 (0.37-0.47)	PASS
Fitzpatrick III-IV	214	187	0.46 (0.44-0.49)	PASS
Fitzpatrick V-VI	7	0	N/A	N/A

Results Summary:

The model demonstrated reliable performance across Fitzpatrick skin types I-IV, meeting all success criteria.
No significant performance disparities were observed among skin tone categories.
The Fitzpatrick V-VI group was underrepresented in the dataset, limiting performance evaluation and indicating a need for further data collection in this demographic.

2. Severity Analysis:

Performance stratified by acneiform inflammatory lesion count severity: Mild (0-5), Moderate (6-20), Severe (21-50), Very severe (50+).
Success criterion: mAP@50 ≥ 0.21 for all severity categories.

Subpopulation	Num. training images	Val. training images	mAP@50	Outcome
Mild	391	57	0.42 (0.34-0.49)	PASS
Moderate	769	152	0.48 (0.44-0.51)	PASS
Severe	384	85	0.47 (0.44-0.51)	PASS
Very severe	135	27	0.43 (0.38-0.47)	PASS

Results Summary:

The model demonstrated reliable performance across different severity levels, with mAP values consistently above the success criterion.
No significant performance disparities were observed among severity categories.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training.
Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types and severity levels, with all success criteria met.
No significant performance disparities were observed, indicating fairness in acneiform inflammatory lesion detection.
Continued efforts to collect diverse data, especially for underrepresented groups, will further enhance model robustness and fairness.

Hive Lesion Quantification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Hive Lesion Quantification section

This AI model detects and counts hives (wheals) in skin structures.

Clinical Significance: Accurate hive counting is essential for the diagnosis, severity assessment, and treatment monitoring of urticaria and related urticarial disorders.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images of skin affected with urticaria with hive bounding boxes following standardized clinical annotation protocols. Annotations consist of tight rectangles around each discrete hive with minimal background. Rectangles are defined by their four corner coordinates (x_min, y_min, x_max, y_max).

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training and validation sets contain images from different patients.

Images with hives: 313, including diverse types of urticaria (e.g., acute, chronic spontaneous urticaria, physical urticaria) obtained from the main dataset by filtering for urticaria-related ICD-11 codes.
Images with healthy skin: 40
Number of subjects: 231
Training set: 256 images
Validation set: 97 images
Average inter-annotator rMAE variability: 0.31 (0.19-0.45)

Training Methodology

The model architecture and all training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different YOLOv8 variants (Nano, Small, Medium) and evaluated multiple data hyperparamters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate). The final configuration was chosen as the best trade-off between detection/count accuracy and runtime efficiency.

Architecture: YOLOv8-M model

Deep learning model tailored for single-class object detection.
Transfer learning from pre-trained weights (COCO dataset)
Input size: 640x640 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

Optimizer: AdamW with learning rate 0.001
Batch size: 48
Training duration: 100 epochs with early stopping

Pre-processing:

Input images were resized and padded to 640x640 pixels.
Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

Confidence threshold of 0.2 to filter low-confidence predictions.
Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Remaining hyperparameters are set to the default values of the Ultralytics framework.

Performance Results

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for the correct location of hives and Relative Mean Absolute Error (rMAE) to account for the correct count of hives. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria are defined as mAP@50 ≥ 0.56 to account for a detection performance non-inferior to published works and rMAE ≤ 0.45, based on expert inter-annotator variability.

Metric	Result	Success Criterion	Outcome
mAP@50	0.69 (0.64-0.74)	≥ 0.56	PASS
Relative Mean Absolute Error (rMAE)	0.28 (0.22-0.34)	≤ 0.45	PASS

Verification and Validation Protocol

Test Design:

Multi-annotator consensus for lesion counts (≥2 annotators per image)
Evaluation across diverse skin tones, and severity levels.

Complete Test Protocol:

Input: RGB images from validation set with expert hive annotations
Processing: Object detection inference with NMS
Output: Predicted bounding boxes with confidence scores and hive counts
Ground truth: Expert-annotated boxes and manual hive counts
Statistical analysis: mAP@50, Relative Mean Absolute Error

Data Analysis Methods:

Precision-Recall and F1-confidence curves
mAP calculation at IoU=0.5 (mAP@50)
Hive count rMAE

Test Conclusions:

The model met all success criteria, demonstrating reliable hive detection and counting performance suitable for clinical urticaria assessment.
The model's performance is within acceptable limits compared to expert inter-annotator variability.
The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: Ensure hive detection performs consistently across demographic subpopulations and disease severity levels.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark)
Success criterion: mAP@50 ≥ 0.56 or rMAE ≤ 0.45 for all Fitzpatrick types

Subpopulation	Num. training images	Num. val images	mAP@50	rMAE	Outcome
Fitzpatrick I-II	140	56	0.68 (0.62-0.74)	0.27 (0.19-0.35)	PASS
Fitzpatrick III-IV	106	32	0.72 (0.66-0.78)	0.32 (0.22-0.44)	PASS
Fitzpatrick V-VI	10	9	0.77 (0.67-0.88)	0.17 (0.05-0.31)	PASS

Results Summary:

All Fitzpatrick skin types met the mAP@50 and rMAE success criterion.
The model performs consistently across different skin tones, indicating effective generalization.

2. Severity Analysis:

Performance stratified by hive count severity: Clear skin (no visible hives), Mild (1-19 hives), Moderate (20-49 hives), Severe (50+ hives)
Success criterion: mAP@50 ≥ 0.56 or rMAE ≤ 0.45 for all Fitzpatrick types

Subpopulation	Num. training images	Val. training images	mAP@50	rMAE	Outcome
Clear	30	10	N/A	0.10 (0.00-0.30)	PASS
Mild	168	53	0.69 (0.62-0.75)	0.34 (0.26-0.44)	PASS
Moderate	52	29	0.73 (0.67-0.79)	0.22 (0.16-0.30)	PASS
Severe	6	5	0.60 (0.48-0.68)	0.22 (0.07-0.38)	PASS

Results Summary:

The model demonstrated reliable overall performance across different severity levels, with mean mAP and rMAE values within acceptable limits.
Confidence intervals for mAP@50 in Severe cases are slightly under the success criterion, presumably caused by the small sample size and by unclear lesion boundaries in images with numerous overlapping hives.
Future data collection should prioritize expanding the dataset for Clear and Severe severity categories to reduce confidence interval variability and improve model robustness for edge cases.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training
Pre-training on diverse data to improve generalization

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types and severity levels, with most success criteria met.
Severe cases showed higher variability likely due to unclear lesion boundaries, suggesting the need for further data collection, and more precise data annotation and model refinement.

Body Surface Segmentation

Model Overview

Reference: R-TF-028-001 AI/ML Description - Body Surface Segmentation section

This model segments affected body surface area for conditions like psoriasis, atopic dermatitis, and vitiligo using scores like BSA%, PASI, EASI, and VASI.

Clinical Significance: BSA percentage is critical for disease severity classification and treatment decisions (e.g., systemic therapy eligibility).

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Polygon annotations for affected areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts traced precise boundaries of affected skin:

Polygon tool for accurate edge delineation
Separate polygons for non-contiguous patches
High spatial precision for reliable area calculation
Multi-annotator consensus for boundary agreement

Dataset statistics:

Images with BSA annotations: [NUMBER] (to be completed)
Training set: [NUMBER] images
Validation set: [NUMBER] images
Test set: [NUMBER] images
Conditions: Psoriasis, atopic dermatitis, vitiligo, others

Training Methodology

Architecture: [U-Net, DeepLabV3+, or similar - to be determined]

Semantic segmentation model with encoder-decoder structure
Transfer learning from pre-trained weights
Input size: [SIZE] pixels

Training approach:

Loss function: Dice loss + Binary Cross-Entropy
Optimizer: [Adam/SGD]
Data augmentation: Rotations, scaling, color jitter, flips
Multi-scale training for varied lesion sizes
Training duration: [NUMBER] epochs

Post-processing:

Morphological operations for boundary refinement
Small region filtering
Surface area calculation with calibration

Performance Results

Success criteria:

IoU ≥ 0.75 (segmentation accuracy)
Dice coefficient ≥ 0.85 (overlap similarity)
Surface area error ≤ 15% (measurement accuracy)
Pixel accuracy ≥ 0.90

Metric	Result	Success Criterion	Outcome
IoU	[TO FILL]	`≥ 0.75`	[PENDING]
Dice	[TO FILL]	`≥ 0.85`	[PENDING]
Area Error (%)	[TO FILL]	`≤ 15%`	[PENDING]
Pixel Accuracy	[TO FILL]	`≥ 0.90`	[PENDING]

Verification and Validation Protocol

Test Design:

Independent test set with expert polygon annotations
Multi-annotator consensus for segmentation masks (minimum 2 dermatologists)
Evaluation across lesion sizes and morphologies

Complete Test Protocol:

Input: RGB images with calibration markers
Processing: Semantic segmentation inference
Output: Predicted masks and calculated BSA%
Ground truth: Expert-annotated masks and reference measurements
Statistical analysis: IoU, Dice, area correlation, Bland-Altman

Data Analysis Methods:

IoU: Intersection/union of predicted and ground truth
Dice: 2×intersection/(area_pred + area_gt)
Pixel-wise sensitivity, specificity, accuracy
Calibrated area calculation
Bland-Altman plots for BSA% agreement
Pearson/Spearman correlation for area measurements

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Ensure BSA segmentation performs consistently across skin types, lesion sizes, and anatomical locations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Dice scores disaggregated by skin type
Recognition that lesion boundaries may have different contrast on darker skin
Success criterion: Dice ≥ 0.80 across all Fitzpatrick types

2. Lesion Size Analysis:

Small (less than 5 cm²), Medium (5-50 cm²), Large (greater than 50 cm²)
Success criterion: IoU ≥ 0.70 for all sizes

3. Lesion Morphology Analysis:

Well-defined vs. ill-defined borders
Regular vs. irregular shapes
Success criterion: Dice variation ≤ 10% across morphologies

4. Anatomical Site Analysis:

Flat surfaces vs. curved/folded areas
Success criterion: IoU variation ≤ 20% across sites

5. Disease Condition Analysis:

Psoriasis, atopic dermatitis, vitiligo performance
Success criterion: Dice ≥ 0.80 for each condition

6. Image Quality Impact:

Performance vs. DIQA scores, angle, distance
Mitigation: Quality filtering, perspective correction

Bias Mitigation Strategies:

Balanced training data across Fitzpatrick types
Multi-scale augmentation
Boundary refinement post-processing

Results Summary: (to be completed)

Subpopulation	Dice	IoU	Area Error	Assessment
Fitzpatrick I-II	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick III-IV	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Small Lesions	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]
Large Lesions	[TO FILL]	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis and Fairness Evaluation

Acneiform Inflammatory Pattern Identification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Acneiform Inflammatory Pattern Identification section

This Acneiform Inflammatory Pattern Identification model assess the intensity of the acneiform inflammatory lesion affection by analyzing tabular features derived from the Acneiform Inflammatory Lesion Quantification algorithm. This model is a mathematical equation designed and optimized with state-of-the-art evolutionary and optimization algorithms to maximize its correlation with the Investigator's Global Assessment (IGA) scores consensued from expert dermatologists criteria.

Clinical Significance: severity assessment aids in achieving a more objective evaluation and select the most appropriate treatment plan.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping. Images belonging to the Clinica Dermatologica Internacional - Acne subset.

Model-specific annotation: Intensity annotation (R-TF-028-004 Data Annotation Instructions)

Images are annotated by a panel of three dermatologists with expertise in acneiform pathologies. Annotations consist of IGA scores ranging from 0 (clear skin) to 4 (severe acne). The final label for each image is determined by the mathematical consensus among the three dermatologists.

Dataset statistics:

Images with acneiform lesions: 331, including diverse types of acneiform inflammatory lesions (e.g., papules, pustules, comedones) obtained from the main dataset.
Number of subjects: N/A
Average inter-annotator Pearson coefficient variability: 0.56 (0.39-0.68)
Since the model is a mathematical equation optimized with evolutionary algorithms rather than a deep neural network, it does not require a conventional train/validation split. This approach is considerably less data-dependent than deep learning models, making the 331-image dataset sufficient for robustly tune and validate the limited number of parameters in the formula.

Training Methodology

Pre-processing

The Acneiform Inflammatory Pattern Identification model analyzes the detection data derived from the Acneiform Inflammatory Lesion Quantification model. This detection data is pre-processed to generate features that include the counts of acneiform inflammatory lesions detected in the image ( $N$ ) and their density ( $D$ ).

Given the localization and dimensions of the detected acneiform lesions, we define the density as the ratio between the overlapping detection area and the total area covered by the detected lesions. Prior to computing density, we pre-process detections by first converting the detected rectangular bounding boxes into circles and enlarging their radius by a factor of 6. The circular transformation refines lesion boundaries by excluding irrelevant bounding box corners, while enlargement improves collision detection among closely positioned lesions. This density score is 0-1 bounded.

Model design and optimization

The Acneiform Inflammatory Pattern Identification formula is designed using a two-stage semi-automatic process based on symbolic learning and differential evolution optimization. The symbolic learning stage generates a set of mathematical expressions that combine both $N$ and $D$ with different mathematical operands. In the second stage, we add constant variables (a and b weights) when possible to the mathematical expressions and refine them with differential evolution optimization. This entire optimization process is designed to maximize the Cohen’s Kappa score by aligning the Acneiform Inflammatory Pattern Identification model's output with the IGA scores consensued from the expert dermatologists's criteria. As a result of this process, we set the Acneiform Inflammatory Pattern Identification model as the formula with the best trade-off between performance and interpretability. The model's scores grow with the number of lesions and density.

Final Acneiform Inflammatory Pattern Identification model is $N^{a} \cdot (D + b)$ with parameters $a = 0.2053$ and $b = 1.0858$ .

For the symbolic learning stage, we use the PySR Python library with 250 iterations, 50 populations, population size 50, maxsize 10, and L2 loss. For the differential evolution optimization stage, we use the SciPy Python library for 10000 iterations (remaining hyperparameters set as default).

Post-processing

Equation scores are clipped to a maximum of 4 to align with the IGA scale (0-4), and later weighted x2.5 to map the output to a 0-10 scale, providing a more granular severity assessment.

Performance Results

Performance is evaluated using the Pearson correlation coefficient, to account for the similarity between model outputs and expert IGA scores. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as Pearson ≥ 0.39 to account for a performance non-inferior to the expert inter-annotator variability.

Metric	Result	Success Criterion	Outcome
Pearson	0.58 (0.50-0.66)	≥ 0.39	PASS

Verification and Validation Protocol

Test Design:

Annotations consensued from medical expert criteria are used as gold standard for validation.
Evaluation across diverse skin tones.

Complete Test Protocol:

Input: acneiform inflammatory lesion detections calculated for the validation set by the Acneiform Inflammatory Lesion Quantification model.
Pre-processing: Feature extraction (lesion count and density).
Processing: Score clipping and scaling.
Output: Predicted intensity scores.
Ground truth: Expert-annotated IGA scores.
Statistical analysis: Pearson correlation coefficient.

Data Analysis Methods:

Pearson correlation coefficient between model outputs and expert IGA scores.

Test Conclusions:

The model met all success criteria, demonstrating reliable acneiform inflammatory pattern identification and suitable for clinical acne severity assessment.
The model demonstrates non-inferiority with respect to expert annotators.
The model's performance is within acceptable limits.
The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: Ensure acneiform inflammatory pattern identification performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
Success criterion: Pearson ≥ 0.39.

Subpopulation	Num. val images	Pearson	Outcome
Fitzpatrick I-II	282	0.58 (0.49-0.67)	PASS
Fitzpatrick III-IV	49	0.57 (0.37-0.73)	PASS
Fitzpatrick V-VI	0	N/A	N/A

Results Summary:

The model demonstrated reliable performance across Fitzpatrick skin types I-IV, meeting all success criteria.
Confidence intervals for the Fitzpatrick III-IV group are wider due to smaller sample size, slightly exceeding the inter-annotator variability range.
The Fitzpatrick V-VI group was underrepresented in the dataset, limiting performance evaluation and indicating a need for further data collection in this demographic.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training.
Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types, with all success criteria met.
Continued efforts to collect diverse data, especially for underrepresented groups, will further enhance model robustness and fairness and provide more robust evaluation statistics.

Follicular and Inflammatory Pattern Identification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Follicular and Inflammatory Pattern Identification section

This model identifies follicular patterns (folliculitis, follicular prominence) and associated inflammatory patterns for conditions involving hair follicles.

Clinical Significance: Essential for diagnosing and characterizing follicular dermatoses and differentiating follicular from non-follicular inflammatory conditions.

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Inflammatory Nodular Lesion Pattern Identification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Inflammatory Pattern Identification section

This model identifies the Hurley stage and inflammatory pattern of inflammatory dermatological conditions.

Clinical Significance: Inflammatory affection categorization is essential for treatment planning and disease monitoring.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping, subset of 188 images from Manises-HS

Model-specific annotation: Image categorization (R-TF-028-004 Data Annotation Instructions Visual Signs - Image Categorization)

A medical expert specialized in inflammatory nodular lesions categorized the images with:

Hurley Stage Classification: One of four categories, including the three Hurley stages and a Clear category that relates to no inflammatory visual signs.
Inflammatory Activity Classification: One of two categories, inflammatory or non-inflammatory.

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training and validation sets contain images from different patients.

Images: 188
Number of subjects: 188
Training set: 150 images of which, 148 contain valid Hurley annotations and 136 contain valid inflammatory activity annotations
Validation set: 38 images of which, 37 contain valid Hurley annotations and 36 contain valid inflammatory activity annotations

Training Methodology

The model architecture and training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different image encoders (e.g., ConvNext and EfficientNet of different sizes) and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate, metric learning). The final configuration was chosen as the best trade-off between performance and runtime efficiency.

Architecture:

The model is a multi-task neural network designed to predict Hurley stages and inflammatory activity simultaneously, while also generating embeddings for metric learning. It uses a shared backbone and common projection head, branching into specific heads for each task.

Backbone (Encoder):
- Model: ConvNext Small, pre-trained on the ImageNet dataset.
- Regularization: dropout and drop path.
Common Projection Head:
- A common processing block that maps encoder features to a shared latent space of 256 features.
- Consists of a GELU activation, Dropout, and a Linear layer.
Task-Specific Heads: The model splits into two distinct branches, one for Hurley and one for Inflammatory Activity. Each branch receives the 256-dimensional output from the Common Projection Head and contains two sub-heads:
- Classification Head:
  - A dedicated block (GELU, Dropout, Linear)
  - Output size: 4 for Hurley and 2 for Inflammatory Activity.
- Metric Embedding Head:
  - A multi-layer perceptron (two sequential blocks of GELU, Dropout, and Linear layers) that outputs feature embeddings.
  - Output size: 256 features.
Weight Initialization:
- Linear Layers: Xavier Normal initialization.
- Biases: Initialized to zero.

The model is implemented with PyTorch and the Python timm library.

Training approach:

The training process employs a multi-task learning strategy, optimizing for both classification accuracy and embedding quality. It utilizes a two-stage approach, starting with a frozen backbone followed by full model fine-tuning. It also incorporates data augmentation and mixed-precision training.

Training Stages:
- Stage 1 (Frozen Backbone): Trains only the projection and task-specific heads for 15 epochs.
- Stage 2 (Fine-tuning): Trains the entire model for 30 epochs.
Optimization:
- Optimizer: AdamW Schedule-Free with weight decay (0.01).
- Base LR: 0.0025
- Learning Rate: Includes a 4-epoch warmup. During fine-tuning, the backbone learning rate is scaled down (0.05x) relative to the heads.
- Gradient Clipping: Gradients are clipped to a norm of 0.5.
- Precision: Mixed precision training using BFloat16.
Loss Functions:
- Classification: Cross-Entropy Loss, weighted to handle class imbalance.
- Metric Learning: NTXentLoss combined with a Batch Easy-Hard Miner (selecting easy positives and hard negatives).

Pre-processing:

Augmentation: Includes geometric and color transformations.
Regularization: MixUp is applied to inputs and labels.
Input: Images are resized to 384x384 with a batch size of 32.

Post-processing:

Classification probabilities are computed applying the softmax opperation over the classification logits.
Classification categories are selected as the ones with higher probability.

Performance Results

Performance is evaluated using accuracy and Mean Absolute Error (MAE) to account for the correct Hurley stage and accuracy and AUC (ROC) to account for the correct inflammatory activity. Succeed critera is set as Accuracy ≥ 40% and MAE ≤ 1 for Hurley stagging and accuracy ≥ 70% and AUC (ROC) ≥ 0.70 for inflammatory activity classification.

Metric	Result	Success Criterion	Outcome
Hurley Stage Accuracy	0.63 (0.46-0.77)	≥ 0.40	PASS
Hurley MAE	0.49 (0.29-0.77)	≤ 1	PASS
Inflammatory Activity Accuracy	0.71 (0.57-0.86)	≥ 0.70	PASS
Inflammatory Activity AUC (ROC)	0.71 (0.49-0.89)	≥ 0.70	PASS

Verification and Validation Protocol

Test Design:

Subset of 35 images with both Hurley stage and inflammatory activity annotations.
Expert-annotator labels.
Evaluation across diverse skin tones.

Complete Test Protocol:

Input: RGB images from validation set with expert annotations
Processing: Image classification inference
Output: Classification probabilities and predicted categories
Ground truth: Expert-annotated categories
Statistical analysis: Accuracy, MAE, AUC (ROC)

Data Analysis Methods:

Confussion matrix
Accuracy, AUC (ROC), MAE

Test Conclusions:

The model's Hurley stage prediction meet all the success criteria, demostrating reliable performance.
The model's Hurley stage prediction is within acceptable limits.
The model's inflammatory activity prediction's mean values meet all the success criteria, demostrating suficient performance.
The model's inflammatory activity prediction's confidence intervals do not fulfill with the success criteria, suggesting the need for further data collection, to improve the model learning and evaluation.

Bias Analysis and Fairness Evaluation

Objective: Ensure Hurley stage and inflammatory activity classification performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
Success criterion: Accuracy ≥ 0.40 and MAE ≤ 1 for Hurley staging; Accuracy ≥ 0.70 and AUC (ROC) ≥ 0.70 for inflammatory activity.

Subpopulation	Num. training images	Num. val images	Hurley Acc	Hurley MAE	Pattern Acc	Pattern AUC (ROC)	Outcome
Fitzpatrick I-II	85	20	0.60 (0.40-0.80)	0.54 (0.25-0.90)	0.70 (0.50-0.90)	0.72 (0.40-0.93)	PASS
Fitzpatrick III-IV	68	15	0.67 (0.40-0.87)	0.40 (0.13-0.67)	0.74 (0.53-0.93)	0.71 (0.33-0.96)	PASS
Fitzpatrick V-VI	0	0	N/A	N/A	N/A	N/A	N/A

Results Summary:

Hurley stagging met all the success criteria across Ftizpatrick I-IV levels.
Hurley stagging presents confidence intervals witih the acceptable limits.
Inflammatory activity identification mean values met all the success criteria across Ftizpatrick I-IV levels.
Inflammatory activity identification confidence intervals exceed the acceptable limits, presumbly due to the small number of images in the validation set.
The Fitzpatrick V-VI subpopulation is not represented in the validation set, limiting the validation of the model's performance in this subpopulation.
Future data collection and annotation should prioritize expanding the dataset to ensure a sufficient number of images for all subpopulations, reduce confidence interval variability, and improve model robustness for edge cases.

Bias Mitigation Strategies:

Image augmentation including color, gemetric and MixUp augmentations during training.
Class-balancing to ensure equal representation of all classes.
Use of metric learning to improve the model's ability to generalize to new data.
Pre-training on diverse data to improve generalization
Two-stage training to fit the model to the new data while benefiting from the image encoder pre-training.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin types with most success criteria met.
Inflammatory activity identification presented off-limits confidence intervals, highlighting the need of more data collection for a more precise training and validation of the model.
More data collection is required to validate the model with higher precision, specially the Fitzpatrick V-VI subpopulations.

Inflammatory Pattern Indicator

Model Overview

Reference: R-TF-028-001 AI/ML Description - Inflammatory Pattern Indicator section

This model provides binary indicators for the presence/absence of specific inflammatory patterns, supporting rapid pattern-based screening and classification.

Clinical Significance: Binary indicators enable efficient triaging and initial assessment of inflammatory pattern presence.

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Dermatology Image Quality Assessment (DIQA)

Model Overview

Reference: R-TF-028-001 AI/ML Description - DIQA section

This model assesses image quality to filter out images unsuitable for clinical analysis, ensuring reliable downstream model performance.

Clinical Significance: DIQA is critical for patient safety by preventing low-quality images from being analyzed, which could lead to incorrect clinical assessments.

Data Requirements and Annotation

Data Requirements: Images annotated with quality scores (focus, lighting, angle, resolution)

Annotation document needed: R-TF-028-004 Data Annotation Instructions - DIQA (to be created)

Medical imaging experts and dermatologists rate images on:

Focus/sharpness (sharp, acceptable, blurry)
Lighting quality (well-lit, acceptable, poor)
Viewing angle (optimal, acceptable, extreme)
Overall usability (excellent, good, acceptable, poor, unusable)

Dataset statistics: (to be completed)

Training Methodology

Architecture: [Classification model - to be determined]

Training approach:

Multi-task learning: Quality dimensions and overall score
Loss function: [Cross-entropy or ordinal regression]
Data augmentation: Synthetic blur, lighting variations, noise injection
Training duration: [NUMBER] epochs

Performance Results

Success criteria:

Multi-class accuracy ≥ 0.85 for quality categories
Binary accuracy ≥ 0.90 for usable vs. unusable
AUC-ROC ≥ 0.90 for quality score prediction

Metric	Result	Success Criterion	Outcome
Multi-class Accuracy	[TO FILL]	`≥ 0.85`	[PENDING]
Binary Accuracy	[TO FILL]	`≥ 0.90`	[PENDING]
AUC-ROC	[TO FILL]	`≥ 0.90`	[PENDING]

Verification and Validation Protocol

Test Design:

Test set with expert quality annotations across quality spectrum
Correlation with objective quality metrics (blur, contrast, exposure)
Clinical relevance testing: correlation between DIQA scores and downstream model performance

Complete Test Protocol:

Input: Images with varying quality levels
Processing: DIQA model inference
Output: Quality scores and usability classification
Ground truth: Expert quality assessments and objective measurements
Statistical analysis: Classification accuracy, AUC, correlation with objective metrics

Data Analysis Methods:

Confusion matrix for multi-class quality assessment
ROC curve analysis for binary usability classification
Correlation analysis: DIQA scores vs. objective metrics (blur kernel size, SNR, dynamic range)
Clinical impact analysis: Correlation between DIQA scores and clinical model performance
Success criterion: Images flagged as "poor" show significantly degraded clinical model performance

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Ensure DIQA performs consistently across devices, conditions, and populations without unfairly rejecting valid images.

Subpopulation Analysis Protocol:

1. Device Analysis:

Performance consistency across imaging devices (smartphones, tablets, professional cameras)
Success criterion: Consistent accuracy across device types

2. Lighting Condition Analysis:

Accuracy across various lighting conditions (natural, artificial, mixed)
Success criterion: No systematic bias against specific lighting types

3. Skin Type Analysis:

Consistency for different Fitzpatrick skin types
Ensure darker skin images aren't systematically rated lower quality
Success criterion: No correlation between Fitzpatrick type and false rejection rate

4. Anatomical Site Analysis:

Performance across body sites with varying texture/features
Success criterion: Consistent performance across sites

Bias Mitigation Strategies:

Training on diverse imaging conditions and device types
Balanced dataset across Fitzpatrick types
Validation that quality assessment is independent of skin tone

Results Summary: (to be completed)

Subpopulation	Accuracy	False Rejection Rate	Assessment
Smartphone	[TO FILL]	[TO FILL]	[PASS/FAIL]
Tablet	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick I-II	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis Conclusion: (to be completed)

Skin Tone Identification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Skin Tone Identification

This model classifies skin into Fitzpatrick (I-VI) and Monk (0-9) tones to enable bias monitoring and fairness evaluation across all clinical models.

Clinical Significance: Essential for ensuring equitable performance across diverse patient populations and detecting potential algorithmic bias.

Data Requirements and Annotation

Compiled dataset: 36563 clinical and non-clinical images sourced from:

representative-expert-annotated-set, composed by clinical and non-clinical images from the subsets 11kHands, smartskins, blackandbrownskin, acne04, SALT-batch3, PAD-UFES-20, Abdominal skin segmentation, and DDI
Fitzpatrick17k
Humanae
MST-E
SCIN

Model-specific annotation: Image categorization (R-TF-028-024 Data Annotation Instructions Visual Signs - Image Categorization)

Annotatios are sourced from each dataset's original annotations if available and valid, or annotated by experts. The annotations are:

Fitzpatrick skin tones: One of six categories, from I to VI.
Monk skin tones: One of ten categories, from 0 to 9.

Dataset statistics:

The dataset is split in train, validation, and test sets. The dataset is split at patient level to avoid data leakage when subject information is available.

Images: 36563
Images can be clinical or non-clinical and span a broad range of skin conditions, body parts, and skin tones.
35074 images with valid fitzpatrick annotations. 36563 images with valid monk annotations.
Train, validation, and test set contain 27343, 4805, and 4415 images respectively.
Images annotated by experts are annotated by 3 persons. Final labels are the mathematical consensus of the 3 annotations.
The inter-rater variability of the representative-expert-annotated-set has Fitzpatrick Accuracy of 0.41 (0.34-0.49) and MAE of 0.70 (0.57-0.81).
The inter-rater variability of the representative-expert-annotated-set has Monk Accuracy of 0.42 (0.37-0.47) and MAE of 0.74 (0.67-0.82).

Training Methodology

Architecture:

The model is a multi-task neural network designed to predict Fitzpatrick and Monk skin tone scales simultaneously, while also generating embeddings for metric learning. It uses a shared backbone and common projection head, branching into specific heads for each skin tone scale.

Backbone (Encoder):
- Model: ConvNext Small, pre-trained on the ImageNet dataset.
- Regularization: dropout and drop path.
Common Projection Head:
- A shared processing block that maps encoder features to a common latent space of 256 features.
- Consists of a ReLU activation, Dropout, and a Linear layer.
Task-Specific Heads: The model splits into two distinct branches, one for Fitzpatrick and one for Monk skin tone scales. Each branch receives the 256-dimensional output from the Common Projection Head and contains two sub-heads:
- Classification Head:
  - A dedicated block (ReLU, Dropout, Linear)
  - Output size: 6 classes for Fitzpatrick (I-VI) and 10 classes for Monk (0-9).
- Metric Embedding Head:
  - A multi-layer perceptron (two sequential blocks of ReLU, Dropout, and Linear layers) that outputs feature embeddings.
  - Output size: 256 features.
Weight Initialization:
- Linear Layers: Kaiming Normal initialization.
- Biases: Initialized to zero.

The model is implemented with PyTorch and the Python timm library.

Training approach:

The training process employs a multi-task learning strategy with Dynamic Weight Averaging (DWA), optimizing for both classification accuracy and embedding quality across Fitzpatrick and Monk skin tone scales. It utilizes a two-stage approach, starting with a frozen backbone followed by full model fine-tuning. It also incorporates data augmentation and mixed-precision training.

Training Stages:
- Stage 1 (Frozen Backbone): Trains only the projection and task-specific heads for 7 epochs.
- Stage 2 (Fine-tuning): Trains the entire model for 25 epochs.
Optimization:
- Optimizer: AdamW Schedule-Free with weight decay (0.01).
- Base LR: 0.0025
- Learning Rate: Includes a 4-epoch warmup. During fine-tuning (Stage 2), the base learning rate is scaled by 0.4x, and the backbone learning rate is further scaled down (0.4x) relative to the task-specific heads.
- Gradient Clipping: Gradients are clipped to a norm of 1.0.
- Precision: Mixed precision training using BFloat16.
Loss Functions:
- Classification: Cross-Entropy Loss with inverse class frequency weighting to handle class imbalance.
- Metric Learning: NTXentLoss for embedding quality.
- Loss Weighting: Dynamic Weight Averaging (DWA) with temperature T=1.0 and scaling enabled, balancing the four losses (Fitzpatrick classification, Fitzpatrick metric, Monk classification, Monk metric).

Pre-processing:

Augmentation: Geometric transformations, contrast and saturation transformations, and coarse dropout.
Input: Images are resized to 384x384 pixels with a batch size of 64.

Post-processing:

Classification probabilities are computed applying the softmax operation over the classification logits.
Classification categories are selected as the ones with higher probability.

Performance Results

Performance is evaluated using accuracy and Mean Absolute Error (MAE) to account for the correct Fitzpatrick and Monk skin tone scales. Succeed critera is set as accuracy ≥ 41% and MAE ≤ 1 for Fitzpatrick and accuracy ≥ 42% and MAE ≤ 1 for Monk. Succed criteria accounts for a model performance on par to the inter-rater performance of expert annotators.

Metric	Result	Success Criterion	Outcome
Fitzpatrick Accuracy	0.62 (0.58-0.65)	≥ 0.41	PASS
Fitzpatrick MAE	0.39 (0.35-0.44)	≤ 1	PASS
Monk Accuracy	0.54 (0.50-0.58)	≥ 0.42	PASS
Monk MAE	0.49 (0.45-0.53)	≤ 1	PASS

Verification and Validation Protocol

Test Design:

640 images from the test split of the representative-expert-annotated-set dataset.
Images are annotated with 3 expert annotators. Annotations consensued from expert criteria are used as gold standard for validation.
Clinical and non-clinical images representing diverse anatomical sites, lighting conditions, and skin tone spectrum.

Complete Test Protocol:

Input: Images of skin from the test set.
Pre-processing: Image resizing to 384x384 pixels and normalization.
Processing: Skin tone classification model inference.
Output: Predicted Fitzpatrick tone (I-VI) and Monk skin tone (0-9) with confidence scores.
Ground truth: Expert-annotated skin tone scales (mathematical consensus from 3 annotators).
Statistical analysis: Accuracy, Mean Absolute Error (MAE).

Data Analysis Methods:

Classification accuracy.
Mean Absolute Error (MAE).
Confusion matrix.

Test Conclusions:

The model met all success criteria, demonstrating reliable skin tone identification suitable for bias monitoring and fairness evaluation.
The model demonstrates non-inferiority with respect to expert annotators.
The model's performance is within acceptable limits.
The model showed robustness across different datasets and imaging conditions, indicating generalizability.

Bias Analysis and Fairness Evaluation

Objective: This model itself is a bias mitigation tool. Validation ensures accurate identification across the full Fitzpatrick spectrum.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Tone Analysis:

Performance stratified by Fitzpatrick skin tones: I-II (light), III-IV (medium), V-VI (dark).
Metrics evaluated: Accuracy and Mean Absolute Error (MAE).
Fitzpatrick success criteria: Accuracy ≥ 0.41; MAE ≤ 1.
Monk success criteria: Accuracy ≥ 0.42; MAE ≤ 1.

Subpopulation	Num. train images	Num. val images	Num. test images	Fitzpatrick Acc	Fitzpatrick MAE	Monk Acc	Monk MAE	Outcome
Fitzpatrick I-II	14627	2391	382	0.55 (0.50-0.59)	0.47 (0.42-0.52)	0.54 (0.48-0.61)	0.49 (0.42-0.56)	PASS
Fitzpatrick III-IV	8886	1482	130	0.65 (0.57-0.72)	0.35 (0.28-0.43)	0.53 (0.47-0.59)	0.50 (0.43-0.57)	PASS
Fitzpatrick V-VI	2799	474	14	0.86 (0.64-1.00)	0.14 (0.00-0.36)	0.56 (0.45-0.67)	0.45 (0.33-0.58)	PASS

Results Summary:

The model met all success criteria, demonstrating reliable skin tone identification suitable for bias monitoring and fairness evaluation.
The model presents consistent robustness across all skin tone subpopulations, for both Fitzpatrick and Monk scales.
The model demonstrates non-inferiority with respect to expert annotators.
The model's performance is within acceptable limits.

Bias Mitigation Strategies:

Image augmentation including gemetric, contrast, saturation and dropout augmentations.
Class-balancing to ensure equal representation of all classes.
Use of metric learning to improve the model's ability to generalize to new data.
Pre-training on diverse data to improve generalization
Two-stage training to fit the model to the new data while benefiting from the image encoder pre-training.

Bias Analysis Conclusion:

The model demonstrated consistent performance across Fitzpatrick skin tones, for both Fitzpatrick and Monk scales.
The model met all success criteria, demonstrating reliable skin tone identification suitable for bias monitoring and fairness evaluation.

Domain Validation

Model Overview

Reference: R-TF-028-001 AI/ML Description - Domain Validation section

This model verifies that input images are within the validated domain (dermatological images) vs. non-skin images, preventing clinical models from processing invalid inputs.

Clinical Significance: Critical safety function preventing misuse and ensuring clinical models only analyze appropriate dermatological images.

Data Requirements and Annotation

Data Requirements:

Positive set: Diverse dermatological images (all conditions, sites, qualities)
Negative set: Non-skin images (objects, animals, scenes, other body parts)

Dataset statistics: (to be completed)

Training Methodology

Architecture: [Binary classifier - to be determined]

Training approach:

Binary classification (in-domain vs. out-of-domain)
Loss function: [Binary cross-entropy]
Negative sampling strategy for diverse out-of-domain examples
Training duration: [NUMBER] epochs

Performance Results

Success criteria:

Sensitivity ≥ 0.95 (correctly identify valid dermatological images)
Specificity ≥ 0.99 (correctly reject non-dermatological images)
False positive rate ≤ 1% (minimize incorrect rejections)

Metric	Result	Success Criterion	Outcome
Sensitivity	[TO FILL]	`≥ 0.95`	[PENDING]
Specificity	[TO FILL]	`≥ 0.99`	[PENDING]
False Positive Rate	[TO FILL]	`≤ 1%`	[PENDING]

Verification and Validation Protocol

Test Design:

Positive set: Diverse dermatological images (all conditions, sites, qualities)
Negative set: Non-skin images (objects, animals, indoor/outdoor scenes, other body parts)
Edge cases: Images with partial skin, heavily zoomed, extreme angles

Complete Test Protocol:

Input: Mixed dataset of in-domain and out-of-domain images
Processing: Binary domain classification
Output: In-domain probability score
Ground truth: Expert-confirmed domain labels
Statistical analysis: Sensitivity, specificity, ROC-AUC, threshold optimization

Data Analysis Methods:

ROC curve analysis to determine optimal threshold
Sensitivity/specificity trade-off analysis
False positive analysis: Characterization of incorrectly rejected valid images
False negative analysis: Characterization of incorrectly accepted invalid images
Clinical safety evaluation: Ensure out-of-domain images don't reach clinical models

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Ensure domain validation doesn't unfairly reject valid dermatological images from any subpopulation.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Equal sensitivity across all Fitzpatrick types
Success criterion: No correlation between skin type and false rejection

2. Anatomical Site Analysis:

Consistent performance across all body sites
Success criterion: No site-specific rejection bias

3. Condition Diversity Analysis:

Robust to various imaging conditions and devices
Success criterion: Consistent specificity across conditions

4. Safety Validation:

Ensure unusual but valid dermatological presentations aren't rejected
Rare conditions, severe cases tested explicitly
Success criterion: Sensitivity ≥ 0.90 for rare but valid conditions

Bias Mitigation Strategies:

Training on comprehensive dermatological diversity
Explicit inclusion of rare conditions in positive set
Conservative threshold setting favoring sensitivity

Results Summary: (to be completed)

Subpopulation	Sensitivity	Specificity	Assessment
Fitzpatrick I-II	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	[TO FILL]	[TO FILL]	[PASS/FAIL]
Rare Conditions	[TO FILL]	[TO FILL]	[PASS/FAIL]
Severe Presentations	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis Conclusion: (to be completed)

Head Detection

Model Overview

Reference: R-TF-028-001 AI/ML Description - Head Detection section

This AI model detects and localizes human heads in images.

Clinical Significance: Automated head detection enables precise head surface analysis by ensuring proper head-centered framing.

Data Requirements and Annotation

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Images were annotated with tight rectangular bounding boxes around head regions. Each bounding box is defined by its four corner coordinates (x_min, y_min, x_max, y_max), delineating the region containing the head with minimal background.

Dataset statistics:

Images with head annotations: 826 images of head with and without skin pathologies
Training set: 661 images
Validation set: 165 images

Training Methodology

Architecture: YOLOv8-S model

Deep learning model tailored for single-class object detection.
Transfer learning from pre-trained COCO weights
Input size: 480x480 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

Optimizer: AdamW with learning rate 0.001 and cosine annealing scheduler
Batch size: 16
Training duration: 150 epochs with early stopping

Pre-processing:

Input images were resized and padded to 480x480 pixels.
Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

Confidence threshold of 0.25 to filter low-confidence predictions.
Non-maximum suppression (NMS) with IoU threshold of 0.7 to eliminate overlapping boxes.

Remaining hyperparameters are set to the default values of the Ultralytics framework.

Performance Results

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for correct head localization. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as mAP@50 ≥ 0.86 to account for detection performance superior to the average performance of published head detection studies.

Metric	Result	Success Criterion	Outcome
mAP@50	0.99 (0.99-0.99)	≥ 0.86	PASS

Verification and Validation Protocol

Test Design:

Expert-annotated bounding boxes used as ground truth for validation.
Evaluation across diverse skin tones and image quality levels.

Complete Test Protocol:

Input: RGB images from validation set with expert head annotations
Processing: Object detection inference with NMS
Output: Predicted bounding boxes with confidence scores and head counts
Ground truth: Expert-annotated boxes
Statistical analysis: mAP@50 with 95% confidence intervals

Data Analysis Methods:

Precision-Recall and F1-confidence curves
mAP calculation at IoU=0.5 (mAP@50)
Mean Absolute Error (MAE) between predicted and ground truth head counts

Test Conclusions:

The model met all success criteria, demonstrating reliable head detection performance suitable for supporting image standardization workflows.
The model demonstrates superior performance to the average performance of previously published head detection studies.
The model's performance is within acceptable limits and shows excellent generalization.

Bias Analysis and Fairness Evaluation

Objective: Ensure head detection performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Tone Analysis:

Performance stratified by Fitzpatrick skin tones: I-II (light), III-IV (medium), V-VI (dark)
Success criterion: mAP@50 ≥ 0.86 for all skin tone groups

Subpopulation	Num. training samples	Num. val samples	mAP@50	Outcome
Fitzpatrick I-II	368	102	0.99 (0.99-0.99)	PASS
Fitzpatrick III-IV	223	44	0.99 (0.97-0.99)	PASS
Fitzpatrick V-VI	70	19	0.99 (0.99-0.99)	PASS

Results Summary:

The model demonstrated excellent performance across all Fitzpatrick skin tones, meeting all success criteria.
No significant performance disparities were observed among skin tone categories.
The model shows robust generalization across diverse skin tones.

Bias Mitigation Strategies:

Image augmentation including color and lighting variations during training
Pre-training on diverse data to improve generalization
Balanced representation of skin tones in the training dataset

Bias Analysis Conclusion:

The model demonstrated consistent and excellent performance across all Fitzpatrick skin tones, with all success criteria met.
No performance disparities were observed, indicating fairness in head detection across diverse populations.
The model is suitable for deployment in diverse clinical and telemedicine settings.

Body Site Identification

Model Overview

Reference: R-TF-028-001 AI/ML Description - Body Site Identification section

This model identifies anatomical locations from images to support site-specific scoring systems and bias monitoring.

Clinical Significance: Anatomical site identification enables automatic application of site-specific scoring rules and demographic performance monitoring.

Data Requirements and Annotation

Data Requirements: Images annotated with anatomical location

Medical experts label images with:

Fine-grained anatomical sites (e.g., dorsal hand, volar forearm)
Body regions (head/neck, trunk, upper extremity, lower extremity)
Consensus for ambiguous cases

Dataset statistics: (to be completed)

Training Methodology

Architecture: [Multi-class classifier - to be determined]

Training approach:

Hierarchical classification (fine-grained and grouped)
Loss function: [Cross-entropy or hierarchical loss]
Data augmentation: Rotations, scaling, perspective
Training duration: [NUMBER] epochs

Performance Results

Success criteria:

Top-1 accuracy ≥ 0.70 (exact anatomical site)
Top-3 accuracy ≥ 0.85 (includes correct site in top 3)
Grouped accuracy ≥ 0.90 (body region level)

Metric	Result	Success Criterion	Outcome
Top-1 Accuracy	[TO FILL]	`≥ 0.70`	[PENDING]
Top-3 Accuracy	[TO FILL]	`≥ 0.85`	[PENDING]
Grouped Accuracy	[TO FILL]	`≥ 0.90`	[PENDING]

Verification and Validation Protocol

Test Design:

Test set with expert-confirmed anatomical locations
Coverage of all major body sites
Challenging cases: Similar sites, unusual angles, partial views

Complete Test Protocol:

Input: Dermatological images from test set
Processing: Multi-class anatomical site classification
Output: Predicted body site with confidence scores
Ground truth: Expert-confirmed anatomical labels
Statistical analysis: Top-k accuracy, confusion matrix, grouped accuracy

Data Analysis Methods:

Multi-class classification accuracy (fine-grained sites)
Hierarchical classification accuracy (body regions)
Confusion matrix to identify commonly confused sites
Clinical impact analysis: Assess whether site misidentifications affect clinical model outputs

Test Conclusions: (to be completed after validation)

Bias Analysis and Fairness Evaluation

Objective: Ensure anatomical site identification performs consistently across demographics.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

Performance across skin types
Recognition that anatomical identification may vary with pigmentation
Success criterion: Accuracy variation ≤ 10% across Fitzpatrick types

2. Age Group Analysis:

Accuracy for pediatric vs. adult anatomy
Success criterion: Consistent performance across age groups

3. Sex/Gender Analysis:

Performance across sex/gender (anatomical differences)
Success criterion: No systematic bias in site recognition

4. Body Habitus Analysis:

Performance across different body types
Success criterion: Consistent accuracy regardless of body habitus

Bias Mitigation Strategies:

Training data balanced across demographics and anatomical sites
Diverse representation of body types
Age-diverse dataset

Results Summary: (to be completed)

Subpopulation	Accuracy	Grouped Accuracy	Assessment
Fitzpatrick I-II	[TO FILL]	[TO FILL]	[PASS/FAIL]
Fitzpatrick V-VI	[TO FILL]	[TO FILL]	[PASS/FAIL]
Pediatric	[TO FILL]	[TO FILL]	[PASS/FAIL]
Adult	[TO FILL]	[TO FILL]	[PASS/FAIL]

Bias Analysis Conclusion: (to be completed)

Summary and Conclusion

The development and validation activities described in this report provide objective evidence that the AI algorithms for Legit.Health Plus meet their predefined specifications and performance requirements.

Status of model development and validation:

ICD Category Distribution and Binary Indicators: [Status to be updated]
Visual Sign Intensity Models: [Status to be updated]
Lesion Quantification Models: [Status to be updated]
Surface Area Models: [Status to be updated]
Non-Clinical Support Models: [Status to be updated]

The development process adhered to the company's QMS and followed Good Machine Learning Practices. Models meeting their success criteria are considered verified, validated, and suitable for release and integration into the Legit.Health Plus medical device.

State of the Art Compliance and Development Lifecycle

Software Development Lifecycle Compliance

The AI models in Legit.Health Plus were developed in accordance with state-of-the-art software development practices and international standards:

Applicable Standards and Guidelines:

IEC 62304:2006+AMD1:2015 - Medical device software lifecycle processes
ISO 13485:2016 - Quality management systems for medical devices
ISO 14971:2019 - Application of risk management to medical devices
ISO/IEC 25010:2011 - Systems and software quality requirements and evaluation (SQuaRE)
FDA Guidance on Software as a Medical Device (SAMD) - Clinical evaluation and predetermined change control plans
IMDRF/SaMD WG/N41 FINAL:2017 - Software as a Medical Device: Key Definitions
Good Machine Learning Practice (GMLP) - FDA/Health Canada/UK MHRA Guiding Principles (2021)
Proposed Regulatory Framework for Modifications to AI/ML-Based SaMD - FDA Discussion Paper (2019)

Development Lifecycle Phases Implemented:

Requirements Analysis: Comprehensive AI model specifications defined in R-TF-028-001 AI/ML Description
Development Planning: Structured development plan in R-TF-028-002 AI Development Plan
Risk Management: AI-specific risk analysis in R-TF-028-011 AI Risk Matrix
Design and Architecture: State-of-the-art architectures (Vision Transformers, CNNs, object detection, segmentation)
Implementation: Following coding standards and version control practices
Verification: Unit testing, integration testing, and algorithm validation
Validation: Clinical performance testing against predefined success criteria
Release: Version-controlled releases with complete traceability
Maintenance: Post-market surveillance and performance monitoring

Version Control and Traceability:

All model versions tracked in version control systems (Git)
Complete traceability from requirements to validation results
Dataset versions documented with checksums and provenance
Model artifacts stored with complete training metadata
Documented change control process for model updates

State of the Art in AI Development

Best Practices Implemented:

1. Data Management Excellence:

Multi-source data collection with demographic diversity
Rigorous data quality control and curation processes
Systematic annotation protocols with multi-expert consensus
Data partitioning strategies preventing data leakage
Sequestered test sets for unbiased evaluation

2. Model Architecture Selection:

Use of state-of-the-art architectures (Vision Transformers for classification, YOLO/Faster R-CNN for detection, U-Net/DeepLab for segmentation)
Transfer learning from large-scale pre-trained models
Architecture selection based on published benchmark performance
Justification of architecture choices documented per model

3. Training Best Practices:

Systematic hyperparameter optimization
Cross-validation and early stopping to prevent overfitting
Data augmentation for robustness and generalization
Multi-task learning where clinically appropriate
Monitoring of training metrics and convergence

4. Model Calibration and Post-Processing:

Temperature scaling for probability calibration
Test-time augmentation for robust predictions
Ensemble methods where applicable
Uncertainty quantification for model predictions

5. Comprehensive Validation:

Independent test sets never used during development
External validation on diverse datasets
Clinical ground truth from expert consensus
Statistical rigor with confidence intervals
Comprehensive subpopulation analysis

6. Bias Mitigation and Fairness:

Systematic bias analysis across demographic subpopulations
Fitzpatrick skin type stratification in all analyses
Data collection strategies ensuring demographic diversity
Bias monitoring models (DIQA, Fitzpatrick identification)
Transparent reporting of performance disparities

7. Explainability and Transparency:

Attention visualization for model interpretability (where applicable)
Clinical reasoning transparency (top-k predictions with probabilities)
Documentation of model limitations and known failure modes
Clear communication of uncertainty in predictions

Risk Management Throughout Lifecycle

Risk Management Process:

Risk management is integrated throughout the entire AI development lifecycle following ISO 14971:

1. Risk Analysis:

Identification of AI-specific hazards (data bias, model errors, distribution shift)
Hazardous situation analysis (incorrect predictions leading to clinical harm)
Risk estimation combining probability and severity

2. Risk Evaluation:

Comparison of risks against predefined acceptability criteria
Benefit-risk analysis for each AI model
Clinical impact assessment of potential errors

3. Risk Control:

Inherent safety by design (offline models, no learning from deployment data)
Protective measures (DIQA filtering, domain validation, confidence thresholds)
Information for safety (user training, clinical decision support context)

4. Residual Risk Evaluation:

Assessment of risks after control measures
Verification that all risks reduced to acceptable levels
Overall residual risk acceptability

5. Risk Management Review:

Production and post-production information review
Update of risk management file
Traceability to safety risk matrix (R-TF-028-011 AI Risk Matrix)

AI-Specific Risk Controls:

Data Quality Risks: Multi-source collection, systematic annotation, quality control
Model Overfitting: Sequestered test sets, cross-validation, regularization
Bias and Fairness: Demographic diversity, subpopulation analysis, bias monitoring
Model Uncertainty: Calibration, confidence scores, uncertainty quantification
Distribution Shift: Domain validation, DIQA filtering, performance monitoring
Clinical Misinterpretation: Clear communication, clinical context, user training

Information Security

Cybersecurity Considerations:

The AI models are designed with information security principles integrated throughout development:

1. Model Security:

Model parameters stored securely with access controls
Model integrity verification (checksums, digital signatures)
Protection against model extraction or reverse engineering
Secure deployment pipelines

2. Data Security:

Patient data protection throughout development (de-identification, anonymization)
Secure data storage with encryption at rest
Secure data transmission with encryption in transit
Access controls and audit logging for training data

3. Inference Security:

Secure API endpoints for model inference
Input validation to prevent adversarial attacks
Rate limiting and authentication
Output validation and sanity checking

4. Privacy Considerations:

No patient-identifiable information stored in models
Training data anonymization and de-identification
Compliance with GDPR, HIPAA, and applicable privacy regulations
Data minimization principles applied

5. Vulnerability Management:

Regular security assessments of AI infrastructure
Dependency scanning for software libraries
Patch management for underlying frameworks
Incident response procedures

6. Adversarial Robustness:

Consideration of adversarial attack scenarios
Input preprocessing to detect anomalous inputs
Domain validation to reject out-of-distribution inputs
DIQA filtering to reject manipulated or low-quality images

Cybersecurity Risk Assessment:

Cybersecurity risks are addressed in the overall device risk management file, including:

Threat modeling for AI components
Attack surface analysis
Mitigation strategies and security controls
Monitoring and incident response

Verification and Validation Strategy

Verification Activities (confirming that the AI models implement their specifications):

Code reviews and static analysis
Unit testing of model components
Integration testing of model pipelines
Architecture validation against specifications
Performance benchmarking against target metrics

Validation Activities (confirming that AI models meet intended use):

Independent test set evaluation with sequestered data
External validation on diverse datasets
Clinical ground truth comparison
Subpopulation performance analysis
Real-world performance assessment
Usability and clinical workflow validation

Documentation of Verification and Validation:

Complete documentation is maintained for all verification and validation activities:

Test protocols with detailed methodology
Complete test results with statistical analysis
Data summaries and test conclusions
Traceability from requirements to test results
Identified deviations and their resolutions

This comprehensive approach ensures compliance with GSPR 17.2 requirements for software development in accordance with state of the art, incorporating development lifecycle management, risk management, information security, verification, and validation.

AI Risks Assessment Report

AI Risk Assessment

A comprehensive risk assessment was conducted throughout the development lifecycle in accordance with the R-TF-028-002 AI Development Plan. All identified AI-specific risks related to data, model training, and performance were documented and analyzed in the R-TF-028-011 AI Risk Matrix.

AI Risk Treatment

Control measures were implemented to mitigate all identified risks. Key controls included:

Rigorous data curation and multi-source collection to mitigate bias.
Systematic model training and validation procedures to prevent overfitting.
Use of a sequestered test set to ensure unbiased performance evaluation.
Implementation of model calibration to improve the reliability of outputs.

Residual AI Risk Assessment

After the implementation of control measures, a residual risk analysis was performed. All identified AI risks were successfully reduced to an acceptable level.

AI Risk and Traceability with Safety Risk

Safety risks related to the AI algorithms (e.g., incorrect diagnosis suggestion, misinterpretation of data) were identified and traced back to their root causes in the AI development process. These safety risks have been escalated for management in the overall device Safety Risk Matrix, in line with ISO 14971.

Conclusion

The AI development process has successfully managed and mitigated inherent risks to an acceptable level. The benefits of using the Legit.Health Plus algorithms as a clinical decision support tool are judged to outweigh the residual risks.

Project Design and Plan

R-TF-028-001 AI/ML Description - Complete specifications for all AI models
R-TF-028-002 AI Development Plan - Development methodology and lifecycle
R-TF-028-011 AI Risk Matrix - AI-specific risk assessment and mitigation

Data Collection and Annotation

R-TF-028-003 Data Collection Instructions - Public datasets and clinical study data collection protocols
R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping - Foundational diagnostic label standardization (completed)
R-TF-028-004 Data Annotation Instructions - Visual Signs - Intensity, count, and extent annotations for visual sign models (completed)
R-TF-028-004 Data Annotation Instructions - DIQA - Image quality assessment annotations (to be created)
R-TF-028-004 Data Annotation Instructions - Fitzpatrick - Skin type annotations (to be created)
R-TF-028-004 Data Annotation Instructions - Body Site - Anatomical location annotations (if needed)

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Author: Team members involved
Reviewer: JD-003, JD-004
Approver: JD-001

Introduction
Data Management
Model Development and Validation
Summary and Conclusion
State of the Art Compliance and Development Lifecycle
AI Risks Assessment Report
Related Documents
- Project Design and Plan
- Data Collection and Annotation

Introduction​

Context​

Algorithms Description​

AI Standalone Evaluation Objectives​

Data Management​

Overview​

Data Collection​

Foundational Annotation: ICD-11 Mapping​

Model Development and Validation​

ICD Category Distribution and Binary Indicators​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Erythema Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Desquamation Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Induration Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Pustule Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Crusting Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Xerosis Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Swelling Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Oozing Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Excoriation Intensity Quantification​

Model Overview​

Data Requirements and Annotation​

Training Methodology​

Performance Results​

Verification and Validation Protocol​

Bias Analysis and Fairness Evaluation​

Lichenification Intensity Quantification​

Introduction

Context

Algorithms Description

AI Standalone Evaluation Objectives

Data Management

Overview

Data Collection

Foundational Annotation: ICD-11 Mapping

Model Development and Validation

ICD Category Distribution and Binary Indicators

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Erythema Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Desquamation Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Induration Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Pustule Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Crusting Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Xerosis Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Swelling Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Oozing Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Excoriation Intensity Quantification

Model Overview

Data Requirements and Annotation

Training Methodology

Performance Results

Verification and Validation Protocol

Bias Analysis and Fairness Evaluation

Lichenification Intensity Quantification