Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • Legit.Health Plus Version 1.1.0.0
    • CAPA Plan - BSI CE Mark Closeout
    • Index
    • Overview and Device Description
    • Information provided by the Manufacturer
    • Design and Manufacturing Information
    • GSPR
    • Benefit-Risk Analysis and Risk Management
    • Product Verification and Validation
      • Software
      • Artificial Intelligence
        • R-TF-028-001 AI Description
        • R-TF-028-002 AI Development Plan
        • R-TF-028-003 Data Collection Instructions: Custom Gathered Data
        • R-TF-028-003 Data Collection Instructions: Archive Data
        • R-TF-028-004 Data Annotation Instructions - Binary Indicator Mapping
        • R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping
        • R-TF-028-004 Data Annotation Instructions - Non-clinical data
        • R-TF-028-004 Data Annotation Instructions - Visual Signs
        • R-TF-028-005 AI Development Report
        • R-TF-028-006 AI Release Report
        • R-TF-028-009 AI Design Checks
        • R-TF-028-010 AI V&V Checks
        • R-TF-028-011 AI Risk Assessment
      • Cybersecurity
      • Usability and Human Factors Engineering
      • Clinical
      • Commissioning
    • Post-Market Surveillance
  • Legit.Health Plus Version 1.1.0.1
  • Legit.Health Utilities
  • Licenses and accreditations
  • Applicable Standards and Regulations
  • Pricing
  • Public tenders
  • Legit.Health Plus Version 1.1.0.0
  • Product Verification and Validation
  • Artificial Intelligence
  • R-TF-028-005 AI Development Report

R-TF-028-005 AI Development Report

Table of contents
  • Introduction
    • Context
    • Algorithms Description
    • AI Standalone Evaluation Objectives
  • Data Management
    • Overview
    • Data Collection
    • Foundational Annotation: ICD-11 Mapping
  • Model Development and Validation
    • ICD Category Distribution and Binary Indicators
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Erythema Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Desquamation Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Induration Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Pustule Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Crusting Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Xerosis Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Swelling Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Oozing Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Excoriation Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Lichenification Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Wound Characteristic Assessment
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Inflammatory Nodular Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Acneiform Lesion Type Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hair Follicle Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Acneiform Inflammatory Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hive Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Body Surface Segmentation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Wound Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Erythema Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hair Loss Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Nail Lesion Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hypopigmentation or Depigmentation Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hyperpigmentation Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Skin Surface Segmentation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Follicular and Inflammatory Pattern Identification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Inflammatory Nodular Lesion Pattern Identification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Dermatology Image Quality Assessment (DIQA)
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Domain Validation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Head Detection
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
  • Summary and Conclusion
  • State of the Art Compliance and Development Lifecycle
    • Software Development Lifecycle Compliance
    • State of the Art in AI Development
    • Risk Management Throughout Lifecycle
    • Information Security
    • Verification and Validation Strategy
  • Integration Verification Package
    • Purpose
    • Package Location and Structure
    • Package Contents Per Model
      • Reference Test Images
      • Expected Outputs File
      • Verification Manifest
    • Acceptance Criteria
    • Model-Specific Package Details
      • Clinical Models - ICD Classification and Binary Indicators
      • Clinical Models - Visual Sign Intensity Quantification
      • Clinical Models - Wound Characteristic Assessment
      • Clinical Models - Lesion Quantification
      • Clinical Models - Surface Area Quantification
      • Clinical Models - Pattern Identification
      • Non-Clinical Models
    • Verification Procedure for Software Integration Team
    • Traceability
  • AI Risks Assessment Report
    • AI Risk Assessment
    • AI Risk Treatment
    • Residual AI Risk Assessment
    • AI Risk and Traceability with Safety Risk
    • Conclusion
  • Related Documents
    • Project Design and Plan
    • Data Collection and Annotation

Introduction​

Context​

This report documents the development, verification, and validation of the AI algorithm package for the Legit.Health Plus medical device. The development process was conducted in accordance with the procedures outlined in GP-028 AI Development and followed the methodologies specified in the R-TF-028-002 AI Development Plan.

The algorithms are designed as offline (static) models. They were trained on a fixed dataset prior to release and do not adapt or learn from new data after deployment. This ensures predictable and consistent performance in the clinical environment.

Algorithms Description​

The Legit.Health Plus device incorporates 59 AI models that work together to fulfill the device's intended purpose. A comprehensive description of all models, their clinical objectives, and performance specifications is provided in R-TF-028-001 AI/ML Description.

The AI algorithm package includes:

Clinical Models (directly fulfilling the intended purpose - 54 models):

  • ICD Category Distribution and Binary Indicators (1 model): Provides interpretative distribution of ICD-11 categories.
  • Visual Sign Intensity Quantification Models (10 models): Quantify the intensity of clinical signs including erythema, desquamation, induration, pustule, crusting, xerosis, swelling, oozing, excoriation, and lichenification.
  • Wound Characteristic Assessment (24 models): Evaluates wound tissue types, characteristics, exudate types, and perilesional conditions.
  • Lesion Quantification Models (5 models):
    • Inflammatory Nodular Lesion Quantification
    • Acneiform Lesion Type Quantification
    • Inflammatory Lesion Quantification
    • Hive Lesion Quantification
    • Hair Follicle Quantification
  • Surface Area Quantification Models (12 models):
    • Erythema Surface Quantification
    • Wound Bed Surface Quantification
    • Angiogenesis and Granulation Tissue Surface Quantification
    • Biofilm and Slough Surface Quantification
    • Necrosis Surface Quantification
    • Maceration Surface Quantification
    • Orthopedic Material Surface Quantification
    • Bone, Cartilage, or Tendon Surface Quantification
    • Hair Loss Surface Quantification
    • Nail Lesion Surface Quantification
    • Hypopigmentation or Depigmentation Surface Quantification
    • Hyperpigmentation Surface Quantification
  • Pattern Identification Models (2 models):
    • Follicular and Inflammatory Pattern Identification
    • Inflammatory Pattern Identification

Non-Clinical Models (supporting proper functioning - 5 models):

  • Domain Validation: Verifies images are within the validated, dermatology domain.
  • Dermatology Image Quality Assessment (DIQA): Ensures image quality is suitable for analysis.
  • Skin Surface Segmentation: Identifies skin regions for analysis.
  • Body Surface Segmentation: Segments body surface for BSA calculations.
  • Head Detection: Localizes heads for privacy and counting workflows.

Total: 54 Clinical Models + 5 Non-Clinical Models = 59 Models

This report focuses on the development methodology, data management processes, and validation results for all models. Each model shares a common data foundation but may require specific annotation procedures as detailed in the respective data annotation instructions.

AI Standalone Evaluation Objectives​

The standalone validation aimed to confirm that all AI models meet their predefined performance criteria as outlined in R-TF-028-001 AI/ML Description.

Performance specifications and success criteria vary by model type and are detailed in the individual model sections of this report. All models were evaluated on independent, held-out test sets that were not used during training or model selection.

Data Management​

Overview​

The development of all AI models in the Legit.Health Plus device relies on a comprehensive dataset compiled from multiple sources and annotated through a multi-stage process. This section describes the general data management workflow that applies to all models, including collection, foundational annotation (ICD-11 mapping), and partitioning. Model-specific annotation procedures are detailed in the individual model sections.

Data Collection​

The dataset was compiled from multiple distinct sources:

  • Archive Data: Images sourced from reputable online sources and private institutions, as detailed in R-TF-028-003 Data Collection Instructions - Archive Data.
  • Custom Gathered Data: Images collected under formal protocols at clinical sites, as detailed in R-TF-028-003 Data Collection Instructions - Custom Gathered Data.

This combined approach resulted in a comprehensive dataset covering diverse demographic characteristics (age, sex, Fitzpatrick skin types I-VI), anatomical sites, imaging conditions, and pathological conditions.

Dataset summary:

ItemValue
Total ICD-11 categories850
Total images280342
Images of FST-189225 (31.83%)
Images of FST-291349 (32.58%)
Images of FST-359610 (21.26%)
Images of FST-423466 (8.37%)
Images of FST-511914 (4.25%)
Images of FST-64778 (1.70%)
Images of female52857 (18.85%)
Images of male55334 (19.74%)
Images of unspecified sex172151 (61.41%)
Images of Pediatric12829 (4.58%)
Images of Adult52694 (18.80%)
Images of Geriatric28350 (10.11%)
Images of unspecified age186469 (66.51%)
IDDataset NameTypeDescriptionICD-11 MappingCropsDiff. DxSexAge
1Torrejon-HCP-diverse-conditionsMultipleDataset of skin images by physicians with good photographic skills✓ YesVaries✓✓✓
2Abdominal-skinArchiveSmall dataset of abdominal pictures with segmentation masks for `Non-specific lesion` class✗ NoYes (programmatic)———
3Basurto-Cruces-MelanomaCustom gatheredClinical validation study dataset (`MC EVCDAO 2019`)✓ YesYes (in-house crops)—✓✓
4BI-GPP (batch 1)ArchiveSmall set of GPP images from Boehringer Ingelheim (first batch)✓ YesNo———
5BI-GPP (batch 2)ArchiveLarge dataset of GPP images from Boehringer Ingelheim (second batch)✓ YesYes (programmatic)—✓✓
6Chiesa-datasetArchiveSample of head and neck lesions (Medela et al., 2024)✓ YesYes (in-house crops)—◐◐
7Figaro 1KArchiveHair style classification and segmentation dataset, repurposed for `Non-specific finding`✗ NoYes (in-house crops)———
8Hand Gesture Recognition (HGR)ArchiveSmall dataset of hands repurposed for non-specific images✗ NoYes (programmatic)———
9IDEI 2024 (pigmented)ArchiveProspective and retrospective studies at IDEI (DERMATIA project), pigmented lesions only✓ YesYes (programmatic)—✓◐
10Manises-HSArchiveLarge collection of hidradenitis suppurativa images✗ NoNot yet—✓✓
11Nails segmentationArchiveSmall nail segmentation dataset repurposed for `non-specific lesion`✗ NoYes (programmatic)———
12Non-specific lesion V2ArchiveSmall representative collection repurposed for `non-specific lesion`✗ NoYes (programmatic)———
13Osakidetza-derivationArchiveClinical validation study dataset (`DAO Derivación O 2022`)✓ YesYes (in-house crops)◐✓✓
14Ribera ulcersArchiveCollection of ulcer images from Ribera Salud✗ NoYes (from wound masks, not all)———
15Transient Biometrics Nails V1ArchiveBiometric dataset of nail images✗ NoYes (programmatic)———
16Transient Biometrics Nails V2ArchiveBiometric dataset of nail images✗ NoNo (close-ups)———
17WoundsDBArchiveSmall chronic wounds database✓ YesNo—✓◐
18Clinica Dermatologica Internacional - AcneCustom gatheredCompilation of images from CDI's acne patients with IGA labels✓ YesNo———
19Manises-DXArchiveLarge collection of images of different skin conditions✓ YesNot yet———

Total datasets: 55 | With ICD-11 mapping: 41

Legend: ✓ = Yes | ◐ = Partial/Pending | — = No

Foundational Annotation: ICD-11 Mapping​

Before any model-specific training could begin, all clinical labels across all data sources were standardized to the ICD-11 classification system. This foundational annotation step is required for all models and is detailed in R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping.

The ICD-11 mapping process involved:

  1. Label Extraction: Extracting all unique clinical labels from each data source
  2. Standardization: Mapping source-specific labels (abbreviations, alternative spellings, legacy coding systems) to standardized ICD-11 categories
  3. Clinical Validation: Expert dermatologist review and validation of all mappings
  4. Visible Category Consolidation: Grouping ICD-11 codes that cannot be reliably distinguished based on visual features alone into unified "Visible ICD-11" categories. To handle images with no visible skin conditions (i.e. "clear" skin), a new Non-specific finding category was created, being the only category that does not have an associated ICD-11 code.

This standardization ensures:

  • Consistent reference standard across all data sources.
  • Clinical validity and regulatory compliance (ICD-11 is the WHO standard).
  • Proper handling of visually similar conditions that require additional clinical information for differentiation.
  • A unified clinical vocabulary for the ICD Category Distribution model and all other clinical models.

Model Development and Validation​

This section details the development, training, and validation of all AI models in the Legit.Health Plus device. Each model subsection includes:

  • Model-specific data annotation requirements
  • Training methodology and architecture
  • Performance evaluation results
  • Bias analysis and fairness considerations

ICD Category Distribution and Binary Indicators​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - ICD Category Distribution and Binary Indicators section

The ICD Category Distribution model is a deep learning classifier that outputs a probability distribution across ICD-11 disease categories. The Binary Indicators are derived from this distribution using an expert-curated mapping matrix.

Models included:

  • ICD Category Distribution (outputs top-5 conditions with probabilities)
  • Binary Indicators (6 derived indicators):
    • Malignant
    • Pre-malignant
    • Associated with malignancy
    • Pigmented lesion
    • Urgent referral (≤48h)
    • High-priority referral (≤2 weeks)

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (as described in R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping)

Binary Indicator Mapping: A dermatologist-validated mapping matrix was created to link each ICD-11 category to the six binary indicators. This mapping defines which disease categories contribute to each indicator (e.g., melanoma, squamous cell carcinoma, and basal cell carcinoma all contribute to the "Malignant" indicator). A complete explanation of Binary Indicator Mapping can be found in R-TF-028-004 Data Annotation Instructions - Binary Indicator Mapping.

The result of the foundational annotation and binary indicator mapping is LegitHealth-DX, which presents high variability in terms of category frequency (i.e. some categories have more images than others). For each category, we split their corresponding images into a training, a validation, and a test set.

In addition to the ICD-11 and binary indicator mapping, we conducted an extra annotation step by identifying the skin condition in the image. This was done by drawing one or more drawing boxes to enclose the visible skin condition in the image. This extra step was motivated by the use of random cropping during data augmentation. Although it is a commonly used trick to increase training diversity, in this scenario there would be a high risk of cropping areas of the image that do not correspond to the actual skin condition, leading to unreliable model learning. By using these manually annotated boxes, we ensure that random areas are picked from one or more of these annotated boxes.

Finally, to ensure a reliable performance for the ICD Category Distribution and Binary Indicator model, we only used the classes from LegitHealth-DX that contain more than 3 images in all splits (training/validation/test).

Dataset statistics:

ItemValue
Total ICD-11 categories850
Total images280342
Clinical images194186 (69.27%)
Dermoscopic images86156 (30.73%)
Selected ICD-11 categories346
Selected total images277415 (98.96%)
Images with annotated ROIs81451 (29.05%)
Training images193686 (69.09%)
Validation images48047 (17.14%)
Test images35726 (12.74%)

Training Methodology​

Pre-processing:

  • Data augmentation during training: bounding-box guided transformations (random erasing, random cropping), random rotations, color jittering, Gaussian noise, random Gaussian and motion blur, and histogram equalization (CLAHE). We also simulated some domain-specific artifacts (dermoscopy shadows, ruler marks, and color patches) to reduce their effect on the training process.
  • In all stages (training/validation/test), images were resized to 384x384 to fit the model's input requirements.

Architecture: ConvNext-V2 (base), with transfer learning from large-scale pre-trained weights. This was chosen as the best performing architecture after comparing a baseline ResNet-50 to different architectures, namely: EfficientNet-V1, EfficientNet-V2, ConvNext-V2, ViT, and DenseNet.

Training:

  • Optimizer: AdamW
  • Loss function: Cross-entropy
  • Learning rate: the optimal learning rate is determined by an automatic range test as proposed in Cyclical Learning Rates for Training Neural Networks (Smith, 2015). We then use a one-cycle policy for faster convergence.
  • Training duration: 50 epochs

Post-processing:

  • Temperature scaling for probability calibration, as described in On Calibration of Modern Neural Networks (Guo et al., 2017)
  • Test-time augmentation (TTA) for robust predictions: at inference time, the test image is augmented via rotation, horizontal and vertical flipping, and histogram equalization, and the predictions of the original image and its augmented views are aggregated to provide a final output.

Performance Results​

ICD Category Distribution Performance:

MetricResultSuccess criterionOutcome
Top-1 accuracy0.6579 (95% CI: [0.6535 - 0.6625])>= 0.50PASS
Top-3 accuracy0.8208 (95% CI: [0.8171 - 0.8247])>= 0.60PASS
Top-5 accuracy0.8644 (95% CI: [0.8611 - 0.8679])>= 0.70PASS

Binary Indicator Performance:

IndicatorResultSuccess criterionOutcome
AUC Malignant0.9180 (95% CI: [0.9136 - 0.9223])>= 0.80PASS
AUC Pre-malignant0.8781 (95% CI: [0.8721 - 0.8839])>= 0.80PASS
AUC Associated to malignancy0.8626 (95% CI: [0.8553 - 0.8696])>= 0.80PASS
AUC Is a pigmented lesion0.9590 (95% CI: [0.9566 - 0.9615])>= 0.80PASS
AUC Urgent referral0.8999 (95% CI: [0.8891 - 0.9105])>= 0.80PASS
AUC High-priority referral0.8876 (95% CI: [0.8838 - 0.8915])>= 0.80PASS

Verification and Validation Protocol​

Test Design:

  • Held-out test set sequestered from training and validation
  • Stratified sampling to ensure representation across ICD-11 categories
  • Independent evaluation on external datasets, with special focus on skin tone diversity

Complete Test Protocol:

  • Input: RGB images from the test set
  • Output: ICD-11 probability distribution and binary indicator scores
  • Reference standard comparison: Manually labeled ICD-11 categories and binary indicator mappings
  • Statistical analysis: Top-k accuracy, AUC-ROC with 95% confidence intervals

Data Analysis Methods:

  • Top-k accuracy calculation with bootstrapping (1000 runs) for confidence intervals
  • ROC curve analysis and AUC calculation for binary indicators with bootstrap confidence intervals (1000 runs)

Test Conclusions:

  • The model met all success criteria, demonstrating reliable performance for both skin disease recognition and binary indicator prediction.

Bias Analysis and Fairness Evaluation​

Objective: Evaluate model performance across demographic subpopulations to identify and mitigate potential biases that could affect clinical safety and effectiveness.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance metrics (Top-k accuracy, AUC) disaggregated by Fitzpatrick types I-II, II-IV, and V-VI
  • Datasets: images from the hold-out test set with Fitzpatrick skin type annotations

2. Age Group Analysis:

  • Stratification: Pediatric (under 18 years), Adult (18-65 years), Elderly (over 65 years)
  • Metrics: Top-k accuracy and AUC per age group
  • Data sources: images from the hold-out test set with age metadata

3. Sex/Gender Analysis:

  • Metrics: Top-k accuracy and AUC per sex group
  • Data sources: images from the hold-out test set with sex metadata

4. Image type analysis:

  • Performance metrics (Top-k accuracy, AUC) disaggregated by image type (clinical and dermoscopy)
  • Data sources: images from the hold-out test set (grouped by image type metadata)

Bias Mitigation Strategies:

  • Multi-source data collection ensuring visual diversity (demographics, acquisition settings, etc)
  • Fitzpatrick skin type identification for bias monitoring
  • Data augmentation targeting underrepresented subgroups
  • Clinical validation with diverse patient populations

Results Summary:

1. Fitzpatrick Skin Type Analysis:

Metricoverallfst: I-IIfst: III-IVfst: V-VI
Top-1 accuracy0.6579 (95% CI: [0.6535 - 0.6625])0.6855 (95% CI: [0.6799 - 0.6911])0.6146 (95% CI: [0.6056 - 0.6237])0.5350 (95% CI: [0.5135 - 0.5566])
Top-3 accuracy0.8208 (95% CI: [0.8171 - 0.8247])0.8501 (95% CI: [0.8459 - 0.8546])0.7740 (95% CI: [0.7655 - 0.7818])0.6937 (95% CI: [0.6737 - 0.7142])
Top-5 accuracy0.8644 (95% CI: [0.8611 - 0.8679])0.8912 (95% CI: [0.8874 - 0.8950])0.8221 (95% CI: [0.8146 - 0.8295])0.7457 (95% CI: [0.7260 - 0.7654])
AUC Malignant0.9180 (95% CI: [0.9136 - 0.9223])0.9180 (95% CI: [0.9129 - 0.9227])0.9194 (95% CI: [0.9101 - 0.9280])0.8364 (95% CI: [0.7937 - 0.8771])
AUC Pre-malignant0.8781 (95% CI: [0.8721 - 0.8839])0.8820 (95% CI: [0.8746 - 0.8892])0.8786 (95% CI: [0.8676 - 0.8900])0.8011 (95% CI: [0.7631 - 0.8399])
AUC Associated to malignancy0.8626 (95% CI: [0.8553 - 0.8696])0.8622 (95% CI: [0.8537 - 0.8703])0.8646 (95% CI: [0.8498 - 0.8791])0.8579 (95% CI: [0.8261 - 0.8858])
AUC Is a pigmented lesion0.9590 (95% CI: [0.9566 - 0.9615])0.9594 (95% CI: [0.9557 - 0.9629])0.9441 (95% CI: [0.9395 - 0.9488])0.9059 (95% CI: [0.8874 - 0.9239])
AUC Urgent referral0.8999 (95% CI: [0.8891 - 0.9105])0.9129 (95% CI: [0.8987 - 0.9256])0.8843 (95% CI: [0.8684 - 0.9000])0.8268 (95% CI: [0.7847 - 0.8648])
AUC High-priority referral0.8876 (95% CI: [0.8838 - 0.8915])0.8900 (95% CI: [0.8851 - 0.8947])0.8834 (95% CI: [0.8760 - 0.8907])0.8546 (95% CI: [0.8330 - 0.8768])

2. Age Group Analysis:

Metricoverallage: 1-Pediatricage: 2-Adultage: 3-Geriatric
Top-1 accuracy0.6579 (95% CI: [0.6535 - 0.6625])0.8764 (95% CI: [0.8635 - 0.8895])0.7104 (95% CI: [0.7017 - 0.7199])0.6244 (95% CI: [0.6103 - 0.6371])
Top-3 accuracy0.8208 (95% CI: [0.8171 - 0.8247])0.9156 (95% CI: [0.9041 - 0.9262])0.8583 (95% CI: [0.8517 - 0.8657])0.8200 (95% CI: [0.8099 - 0.8297])
Top-5 accuracy0.8644 (95% CI: [0.8611 - 0.8679])0.9272 (95% CI: [0.9167 - 0.9375])0.8980 (95% CI: [0.8922 - 0.9042])0.8776 (95% CI: [0.8683 - 0.8864])
AUC Malignant0.9180 (95% CI: [0.9136 - 0.9223])0.7327 (95% CI: [0.5924 - 0.8706])0.9104 (95% CI: [0.9022 - 0.9182])0.8621 (95% CI: [0.8520 - 0.8726])
AUC Pre-malignant0.8781 (95% CI: [0.8721 - 0.8839])0.9729 (95% CI: [0.9358 - 0.9941])0.8935 (95% CI: [0.8766 - 0.9093])0.8023 (95% CI: [0.7813 - 0.8230])
AUC Associated to malignancy0.8626 (95% CI: [0.8553 - 0.8696])0.8142 (95% CI: [0.7204 - 0.8992])0.8354 (95% CI: [0.8199 - 0.8499])0.8368 (95% CI: [0.8228 - 0.8496])
AUC Is a pigmented lesion0.9590 (95% CI: [0.9566 - 0.9615])0.9913 (95% CI: [0.9835 - 0.9971])0.9847 (95% CI: [0.9808 - 0.9883])0.9087 (95% CI: [0.8871 - 0.9284])
AUC Urgent referral0.8999 (95% CI: [0.8891 - 0.9105])0.9628 (95% CI: [0.9281 - 0.9833])0.9002 (95% CI: [0.8755 - 0.9236])0.8882 (95% CI: [0.8400 - 0.9306])
AUC High-priority referral0.8876 (95% CI: [0.8838 - 0.8915])0.9334 (95% CI: [0.9037 - 0.9574])0.8834 (95% CI: [0.8753 - 0.8915])0.8525 (95% CI: [0.8416 - 0.8633])

3. Sex/Gender Analysis:

Metricoverallsex: 1-malesex: 2-female
Top-1 accuracy0.6579 (95% CI: [0.6535 - 0.6625])0.7195 (95% CI: [0.7111 - 0.7290])0.7143 (95% CI: [0.7049 - 0.7239])
Top-3 accuracy0.8208 (95% CI: [0.8171 - 0.8247])0.8625 (95% CI: [0.8560 - 0.8694])0.8591 (95% CI: [0.8518 - 0.8665])
Top-5 accuracy0.8644 (95% CI: [0.8611 - 0.8679])0.9024 (95% CI: [0.8966 - 0.9083])0.8988 (95% CI: [0.8924 - 0.9050])
AUC Malignant0.9180 (95% CI: [0.9136 - 0.9223])0.9214 (95% CI: [0.9147 - 0.9276])0.9152 (95% CI: [0.9077 - 0.9228])
AUC Pre-malignant0.8781 (95% CI: [0.8721 - 0.8839])0.8603 (95% CI: [0.8422 - 0.8777])0.8973 (95% CI: [0.8828 - 0.9102])
AUC Associated to malignancy0.8626 (95% CI: [0.8553 - 0.8696])0.8606 (95% CI: [0.8477 - 0.8727])0.8485 (95% CI: [0.8351 - 0.8611])
AUC Is a pigmented lesion0.9590 (95% CI: [0.9566 - 0.9615])0.9748 (95% CI: [0.9693 - 0.9802])0.9871 (95% CI: [0.9839 - 0.9901])
AUC Urgent referral0.8999 (95% CI: [0.8891 - 0.9105])0.9149 (95% CI: [0.8855 - 0.9405])0.8979 (95% CI: [0.8725 - 0.9231])
AUC High-priority referral0.8876 (95% CI: [0.8838 - 0.8915])0.9087 (95% CI: [0.9019 - 0.9153])0.8915 (95% CI: [0.8839 - 0.8993])

4. Image type Analysis:

Metricoverallimage-type: clinicalimage-type: dermoscopic
Top-1 accuracy0.6579 (95% CI: [0.6535 - 0.6625])0.5985 (95% CI: [0.5923 - 0.6048])0.7579 (95% CI: [0.7508 - 0.7648])
Top-3 accuracy0.8208 (95% CI: [0.8171 - 0.8247])0.7662 (95% CI: [0.7610 - 0.7717])0.9126 (95% CI: [0.9078 - 0.9173])
Top-5 accuracy0.8644 (95% CI: [0.8611 - 0.8679])0.8173 (95% CI: [0.8126 - 0.8222])0.9437 (95% CI: [0.9396 - 0.9473])
AUC Malignant0.9180 (95% CI: [0.9136 - 0.9223])0.9240 (95% CI: [0.9179 - 0.9301])0.9079 (95% CI: [0.9015 - 0.9139])
AUC Pre-malignant0.8781 (95% CI: [0.8721 - 0.8839])0.8814 (95% CI: [0.8737 - 0.8889])0.8733 (95% CI: [0.8626 - 0.8840])
AUC Associated to malignancy0.8626 (95% CI: [0.8553 - 0.8696])0.8636 (95% CI: [0.8545 - 0.8730])0.8625 (95% CI: [0.8516 - 0.8723])
AUC Is a pigmented lesion0.9590 (95% CI: [0.9566 - 0.9615])0.9420 (95% CI: [0.9389 - 0.9451])0.8170 (95% CI: [0.7745 - 0.8543])
AUC Urgent referral0.8999 (95% CI: [0.8891 - 0.9105])0.8798 (95% CI: [0.8690 - 0.8905])0.8214 (95% CI: [0.7242 - 0.9133])
AUC High-priority referral0.8876 (95% CI: [0.8838 - 0.8915])0.8878 (95% CI: [0.8827 - 0.8927])0.8842 (95% CI: [0.8777 - 0.8909])

Bias Analysis Conclusion:

  • In terms of image type, the model meets the expected performance goals, showing a remarkably exceptional performance on dermoscopy images.
  • The model meets the performance goals for all age groups, with exceptional classification performance on pediatric subjects. Binary indicator prediction performance is excellent for all age groups.
  • The model meets the performance goals for all sexes, showing almost identical performance for both male and female subjects.
  • In terms of Fitzpatrick skin types, the model meets the performance goals for binary indicator prediction for all skin tones. When it comes to ICD-11 condition classification, all performance thresholds are met, but the model shows a slightly degraded performance for dark skin tones (FST V-VI).

Erythema Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Erythema Intensity Quantification section

This model quantifies erythema (redness) intensity on an ordinal scale (0-9), outputting a probability distribution that is converted to a continuous severity score via weighted expected value calculation.

Clinical Significance: Erythema is a cardinal sign of inflammation in numerous dermatological conditions including psoriasis, atopic dermatitis, and other inflammatory dermatoses.

Data Requirements and Annotation​

Model-specific annotation: Erythema intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with erythema intensity scores following standardized clinical scoring protocols (e.g., Clinician's Erythema Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with erythema annotations: 5557
  • Training set: 90% of the erythema images plus 10% of healthy skin images
  • Validation set: 10% of the erythema images
  • Test set: 10% of the erythema images
  • Annotations variability:
    • Mean RMAE: 0.172
    • 95% CI: [0.154, 0.191]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, eczema, contact dermatitis, hidradentitis suppurativa, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 14% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.13 (0.119, 0.142)543≤ 14%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.13 (95% CI: 0.119-0.142), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure erythema quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for erythema):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.124 (0.111, 0.141)293≤ 14%PASS
RMAE Fitzpatrick III-IV0.135 (0.12, 0.152)207≤ 14%PASS
RMAE Fitzpatrick V-VI0.142 (0.098, 0.191)43≤ 14%PASS
RMAE Mild Severity (0-3)0.149 (0.119, 0.183)98≤ 14%PASS
RMAE Moderate Severity (4-6)0.138 (0.124, 0.155)236≤ 14%PASS
RMAE Severe Severity (7-9)0.112 (0.095, 0.13)209≤ 14%PASS

Bias Analysis Conclusion:

The erythema intensity quantification model demonstrates a high degree of clinical potential, with its performance successfully benchmarked against a stringent Success Criterion\text{Success Criterion}Success Criterion of RMAE≤14%\text{RMAE} \le 14\%RMAE≤14%, derived from inter-annotator variability. Crucially, the model's performance is strongest and most certain in the Severe Severity category, where both the mean RMAE\text{RMAE}RMAE (0.112\mathbf{0.112}0.112) and the entire 95% CI\text{95\% CI}95% CI (0.095−0.130\mathbf{0.095} - \mathbf{0.130}0.095−0.130) are definitively below the 14%14\%14% threshold, confirming statistically robust and highly precise quantification in critical cases. Furthermore, the mean RMAE\text{RMAE}RMAE for the three largest subgroups-Fitzpatrick I-II (0.124\mathbf{0.124}0.124), Fitzpatrick III-IV (0.135\mathbf{0.135}0.135), and Moderate Severity (0.138\mathbf{0.138}0.138) are all successfully below the 14%14\%14% criterion, establishing a strong foundation of average accuracy across primary populations. The 95%CI95\% \text{CI}95%CI lower bound for every single subgroup, including the less represented Fitzpatrick V-VI (0.098\mathbf{0.098}0.098) and Mild Severity (0.119\mathbf{0.119}0.119) groups, successfully falls below the 14%14\%14% target. This indicates that the model's performance is consistently comparable to or superior to expert variability across all tested strata.

Desquamation Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Desquamation Intensity Quantification section

This model quantifies desquamation (scaling/peeling) intensity on an ordinal scale (0-9), critical for assessment of psoriasis, seborrheic dermatitis, and other scaling conditions.

Clinical Significance: Desquamation is a key indicator in many inflammatory dermatoses.

Data Requirements and Annotation​

Model-specific annotation: Desquamation intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with desquamation intensity scores following standardized clinical scoring protocols (e.g., Clinician's Desquamation Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with desquamation annotations: 4879
  • Training set: 90% of the desquamation images plus 10% of healthy skin images
  • Validation set: 10% of the desquamation images
  • Test set: 10% of the desquamation images
  • Annotations variability:
    • Mean RMAE: 0.202
    • 95% CI: [0.178, 0.226]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 17% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.153 (0.139, 0.167)475≤ 17%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.153 (95% CI: 0.139-0.167), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure desquamation quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for desquamation):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.156 (0.136, 0.176)255≤ 17%PASS
RMAE Fitzpatrick III-IV0.154 (0.131, 0.176)187≤ 17%PASS
RMAE Fitzpatrick V-VI0.118 (0.077, 0.162)33≤ 17%PASS
RMAE Mild Severity (0-3)0.140 (0.121, 0.161)231≤ 17%PASS
RMAE Moderate Severity (4-6)0.161 (0.134, 0.189)119≤ 17%PASS
RMAE Severe Severity (7-9)0.167 (0.139, 0.199)125≤ 17%PASS

Bias Analysis Conclusion:

The desquamation quantification model demonstrates robust and highly reliable performance, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤17%\text{RMAE} \le 17\%RMAE≤17%, which is derived from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 17%17\%17%, is successfully achieved by all six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for all subgroups-including the largest Fitzpatrick I-II (0.156\mathbf{0.156}0.156) and Mild Severity (0.140\mathbf{0.140}0.140) cohorts-successfully positioned below the 17%17\%17% criterion. Notably, the RMAE\text{RMAE}RMAE mean for the Fitzpatrick V-VI group (0.118\mathbf{0.118}0.118) is also significantly lower than the criterion. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of desquamation across all demographic and severity ranges.

Induration Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Induration Intensity Quantification section

This model quantifies induration (plaque thickness/elevation) on an ordinal scale (0-9).

Clinical Significance: Induration reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Induration intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with induration intensity scores following standardized clinical scoring protocols (e.g., Clinician's Induration Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with induration annotations: 4499
  • Training set: 90% of the induration images plus 10% of healthy skin images
  • Validation set: 10% of the induration images
  • Test set: 10% of the induration images
  • Annotations variability:
    • Mean RMAE: 0.178
    • 95% CI: [0.159, 0.199]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 17% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.151 (0.137, 0.167)437≤ 17%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.151 (95% CI: 0.137-0.167), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure induration quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for induration):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.130 (0.111, 0.148)217≤ 17%PASS
RMAE Fitzpatrick III-IV0.178 (0.152, 0.204)187≤ 17%PASS
RMAE Fitzpatrick V-VI0.141 (0.101, 0.189)33≤ 17%PASS
RMAE Mild Severity (0-3)0.138 (0.122, 0.156)256≤ 17%PASS
RMAE Moderate Severity (4-6)0.176 (0.150, 0.204)120≤ 17%PASS
RMAE Severe Severity (7-9)0.158 (0.107, 0.219)61≤ 17%PASS

Bias Analysis Conclusion:

The induration quantification model demonstrates universally robust and reliable performance, successfully exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤17%\text{RMAE} \le 17\%RMAE≤17%, which is derived from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 17%17\%17%, is successfully achieved by all lower CI bound of the six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for all subgroups-including the largest Fitzpatrick I-II (0.130\mathbf{0.130}0.130) and Mild Severity (0.138\mathbf{0.138}0.138) cohorts-successfully positioned below the 17%17\%17% criterion. Notably, the RMAE\text{RMAE}RMAE mean for the Fitzpatrick V-VI group (0.141\mathbf{0.141}0.141) is also significantly lower than the criterion. Furthermore, the 95%CI95\% \text{CI}95%CI lower bounds for even the highest mean RMAE\text{RMAE}RMAE subgroups, such as Fitzpatrick III-IV (0.152\mathbf{0.152}0.152) and Moderate Severity (0.150\mathbf{0.150}0.150), are successfully below the 17%17\%17% threshold. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of induration across all demographic and severity ranges.

Pustule Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Pustule Intensity Quantification section

This model quantifies pustule intensity/density on an ordinal scale (0-9), critical for pustular psoriasis, acne, and other pustular dermatoses.

Clinical Significance: Pustule reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Induration intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with pustule intensity scores following standardized clinical scoring protocols (e.g., Clinician's Pustule Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with pustule annotations: 380
  • Training set: 90% of the pustule images plus 10% of healthy skin images
  • Validation set: 10% of the pustule images
  • Test set: 10% of the pustule images
  • Annotations variability:
    • Mean RMAE: 0.300
    • 95% CI: [0.191, 0.427]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 30% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.19 (0.123, 0.269)38≤ 30%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.19 (95% CI: 0.123-0.269), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure pustule quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for pustule):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.158 (0.09, 0.226)26≤ 30%PASS
RMAE Fitzpatrick III-IV0.259 (0.111, 0.426)12≤ 30%PASS
RMAE Fitzpatrick V-VI-0≤ 30%N/A
RMAE Mild Severity (0-3)0.143 (0.016, 0.302)14≤ 30%PASS
RMAE Moderate Severity (4-6)0.222 (0.130, 0.296)6≤ 30%PASS
RMAE Severe Severity (7-9)0.216 (0.130, 0.309)18≤ 30%PASS

Bias Analysis Conclusion:

The pustulation quantification model exhibits strong initial performance against a highly stringent Success Criterion\text{Success Criterion}Success Criterion of RMAE≤30%\text{RMAE} \le 30\%RMAE≤30%, which is derived from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 30%30\%30%, is successfully achieved by five tested subgroups. This confirms a foundational level of reliability and low initial bias, as the model is consistently capable of achieving an accuracy comparable to or significantly superior to expert variability across the tested strata. Notably, the mean RMAE\text{RMAE}RMAE for the highly-represented Fitzpatrick I-II (0.158\mathbf{0.158}0.158) and Mild Severity (0.143\mathbf{0.143}0.143) subgroups are substantially below the 30%30\%30% criterion, establishing excellent average accuracy in primary populations. The current absence of data for the Fitzpatrick V-VI stratum highlights the need for future targeted sampling to ensure comprehensive clinical validation.

Crusting Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Crusting Intensity Quantification section

This model quantifies crusting severity on an ordinal scale (0-9).

Clinical Significance: Crusting is a key clinical sign in various dermatological conditions, indicating disease activity and severity.

Data Requirements and Annotation​

Model-specific annotation: Crusting intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with crusting intensity scores following standardized clinical scoring protocols (e.g., Clinician's Crusting Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with crusting annotations: 1999
  • Training set: 90% of the crusting images plus 10% of healthy skin images
  • Validation set: 10% of the crusting images
  • Test set: 10% of the crusting images
  • Annotations variability:
    • Mean RMAE: 0.202
    • 95% CI: [0.178, 0.226]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 20% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.153 (0.139, 0.167)475≤ 20%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.153 (95% CI: 0.139-0.167), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure crusting quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for crusting):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.156 (0.136, 0.176)255≤ 20%PASS
RMAE Fitzpatrick III-IV0.154 (0.131, 0.176)187≤ 20%PASS
RMAE Fitzpatrick V-VI0.118 (0.077, 0.162)33≤ 20%PASS
RMAE Mild Severity (0-3)0.140 (0.121, 0.161)231≤ 20%PASS
RMAE Moderate Severity (4-6)0.161 (0.134, 0.189)119≤ 20%PASS
RMAE Severe Severity (7-9)0.167 (0.139, 0.199)125≤ 20%PASS

Bias Analysis Conclusion:

The crusting quantification model demonstrates exceptionally reliable performance and clinical viability, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤20%\text{RMAE} \le 20\%RMAE≤20%, a benchmark established from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 20%20\%20%, is successfully achieved by all six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for all six subgroups, including the larger Fitzpatrick I-II (0.156\mathbf{0.156}0.156) and Mild Severity (0.140\mathbf{0.140}0.140) cohorts, successfully positioned below the 20%20\%20% criterion. Notably, the Fitzpatrick V-VI group exhibits the lowest mean RMAE\text{RMAE}RMAE (0.118\mathbf{0.118}0.118). This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of crusting across all demographic and severity ranges.

Xerosis Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Xerosis Intensity Quantification section

This model quantifies xerosis (dry skin) severity on an ordinal scale (0-9), fundamental for skin barrier assessment.

Clinical Significance: Xerosis reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Induration intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with xerosis intensity scores following standardized clinical scoring protocols (e.g., Clinician's Xerosis Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with xerosis annotations: 1999
  • Training set: 90% of the xerosis images plus 10% of healthy skin images
  • Validation set: 10% of the xerosis images
  • Test set: 10% of the xerosis images
  • Annotations variability:
    • Mean RMAE: 0.201
    • 95% CI: [0.169, 0.234]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 20% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.155 (0.135, 0.177)198≤ 20%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.155 (95% CI: 0.135-0.177), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure xerosis quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for xerosis):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.148 (0.125, 0.174)110≤ 20%PASS
RMAE Fitzpatrick III-IV0.16 (0.126, 0.199)80≤ 20%PASS
RMAE Fitzpatrick V-VI0.208 (0.097, 0.361)8≤ 20%PASS
RMAE Mild Severity (0-3)0.136 (0.113, 0.163)109≤ 20%PASS
RMAE Moderate Severity (4-6)0.163 (0.132, 0.197)70≤ 20%PASS
RMAE Severe Severity (7-9)0.24 (0.135, 0.368)19≤ 20%PASS

Bias Analysis Conclusion:

The xerosis quantification model demonstrates successful performance and high clinical viability, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤20%\text{RMAE} \le 20\%RMAE≤20%, a threshold established from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 20%20\%20%, is achieved by all six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for five of the six subgroups—including the highly-represented Fitzpatrick I-II (0.148\mathbf{0.148}0.148) and Mild Severity (0.136\mathbf{0.136}0.136) cohorts—successfully positioned below the 20%20\%20% criterion. The 95%CI95\% \text{CI}95%CI lower bounds for even the smaller and higher RMAE\text{RMAE}RMAE subgroups, such as Severe Severity (0.135\mathbf{0.135}0.135) and Fitzpatrick V-VI (0.097\mathbf{0.097}0.097), are significantly below 20%20\%20%. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of xerosis across all demographic and severity ranges.

Swelling Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Swelling Intensity Quantification section

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual signs)

This model quantifies swelling/edema severity on an ordinal scale (0-9), relevant for acute inflammatory conditions.

Clinical Significance: Swelling reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Induration intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with swelling intensity scores following standardized clinical scoring protocols (e.g., Clinician's Swelling Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with swelling annotations: 1999
  • Training set: 90% of the swelling images plus 10% of healthy skin images
  • Validation set: 10% of the swelling images
  • Test set: 10% of the swelling images
  • Annotations variability:
    • Mean RMAE: 0.220
    • 95% CI: [0.186, 0.256]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 18% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.153 (0.131, 0.176)198≤ 18%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions: The Swelling Intensity Quantification model successfully met the predefined success criterion (RMAE ≤ 18%), achieving an RMAE of 0.153 (95% CI: 0.131-0.176). This performance demonstrates that the model quantifies swelling intensity with accuracy superior to inter-observer variability among expert dermatologists. The model is validated for clinical use in assessing swelling severity across diverse patient populations.

Bias Analysis and Fairness Evaluation​

Objective: Ensure swelling quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for swelling):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.146 (0.122, 0.172)116≤ 18%PASS
RMAE Fitzpatrick III-IV0.156 (0.119, 0.196)72≤ 18%PASS
RMAE Fitzpatrick V-VI0.211 (0.056, 0.4)10≤ 18%PASS
RMAE Mild Severity (0-3)0.133 (0.107, 0.161)129≤ 18%PASS
RMAE Moderate Severity (4-6)0.179 (0.14, 0.219)39≤ 18%PASS
RMAE Severe Severity (7-9)0.204 (0.141, 0.281)30≤ 18%PASS

Bias Analysis Conclusion:

The swelling quantification model demonstrates universally successful performance and strong clinical viability, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤18%\text{RMAE} \le 18\%RMAE≤18%, a threshold established from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 18%18\%18%, is successfully achieved by all six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for five of the six subgroups—including the highly-represented Fitzpatrick I-II (0.146\mathbf{0.146}0.146), Fitzpatrick III-IV (0.156\mathbf{0.156}0.156), and Mild Severity (0.133\mathbf{0.133}0.133) cohorts—successfully positioned below the 18%18\%18% criterion. The 95% CI\text{95\% CI}95% CI lower bounds for even the smallest subgroups, such as Fitzpatrick V-VI (0.056\mathbf{0.056}0.056) and Severe Severity (0.141\mathbf{0.141}0.141), are notably below the 18%18\%18% standard. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of swelling across all demographic and severity ranges.

Oozing Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Oozing Intensity Quantification section

This model quantifies oozing/exudation severity on an ordinal scale (0-9), important for acute eczema and wound assessment.

Clinical Significance: Induration reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Oozing intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with oozing intensity scores following standardized clinical scoring protocols (e.g., Clinician's Oozing Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with oozing annotations: 4879
  • Training set: 90% of the oozing images plus 10% of healthy skin images
  • Validation set: 10% of the oozing images
  • Test set: 10% of the oozing images
  • Annotations variability:
    • Mean RMAE: 0.202
    • 95% CI: [0.178, 0.226]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 17% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.153 (0.139, 0.167)475≤ 17%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.153 (95% CI: 0.139-0.167), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure oozing quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.156 (0.136, 0.176)255≤ 17%PASS
RMAE Fitzpatrick III-IV0.154 (0.131, 0.176)187≤ 17%PASS
RMAE Fitzpatrick V-VI0.118 (0.077, 0.162)33≤ 17%PASS
RMAE Mild Severity (0-3)0.14 (0.121, 0.161)231≤ 17%PASS
RMAE Moderate Severity (4-6)0.161 (0.134, 0.189)119≤ 17%PASS
RMAE Severe Severity (7-9)0.167 (0.139, 0.199)125≤ 17%PASS

Bias Analysis Conclusion:

The oozing quantification model demonstrates exceptionally reliable performance and clinical viability, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤17%\text{RMAE} \le 17\%RMAE≤17%, a benchmark established from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 17%17\%17%, is successfully achieved by all six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for all six subgroups, including the larger Fitzpatrick I-II (0.156\mathbf{0.156}0.156) and Mild Severity (0.140\mathbf{0.140}0.140) cohorts, successfully positioned below the 17%17\%17% criterion. Notably, the Fitzpatrick V-VI group exhibits the lowest mean RMAE\text{RMAE}RMAE (0.118\mathbf{0.118}0.118), and its entire 95% CI\text{95\% CI}95% CI (0.077−0.162\mathbf{0.077} - \mathbf{0.162}0.077−0.162) is well contained below the 17%17\%17% threshold. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of oozing across all demographic and severity ranges.

Excoriation Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Excoriation Intensity Quantification section

This model quantifies excoriation (scratch marks) severity on an ordinal scale (0-9), relevant for atopic dermatitis and pruritic conditions.

Clinical Significance: Induration reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Excoriation intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with excoriation intensity scores following standardized clinical scoring protocols (e.g., Clinician's Excoriation Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with excoriation annotations: 1999
  • Training set: 90% of the excoriation images plus 10% of healthy skin images
  • Validation set: 10% of the excoriation images
  • Test set: 10% of the excoriation images
  • Annotations variability:
    • Mean RMAE: 0.140
    • 95% CI: [0.109, 0.172]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 14% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.106 (0.089, 0.125)198≤ 14%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.106 (95% CI: 0.089-0.125), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure excoriation quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for excoriation):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.109 (0.089, 0.131)105≤ 14%PASS
RMAE Fitzpatrick III-IV0.104 (0.078, 0.133)75≤ 14%PASS
RMAE Fitzpatrick V-VI0.093 (0.037, 0.154)18≤ 14%PASS
RMAE Mild Severity (0-3)0.099 (0.081, 0.119)189≤ 14%PASS
RMAE Moderate Severity (4-6)0.222 (0.111, 0.333)7≤ 14%PASS
RMAE Severe Severity (7-9)0.333 (0.333, 0.333)2≤ 14%NO PASS

Bias Analysis Conclusion:

The excoriation quantification model demonstrates consistently high performance and strong clinical viability, successfully meeting the stringent Success Criterion\text{Success Criterion}Success Criterion of RMAE≤14%\text{RMAE} \le 14\%RMAE≤14%, a threshold established from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 14%14\%14%, is achieved by five of the six tested subgroups. This confirms that the model's minimum reliable accuracy is comparable to or superior to expert variability across the majority of strata. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for five subgroups-including the highly-represented Fitzpatrick I-II (0.109\mathbf{0.109}0.109), Fitzpatrick III-IV (0.104\mathbf{0.104}0.104), and Mild Severity (0.099\mathbf{0.099}0.099) cohorts-all successfully positioned well below the 14%14\%14% criterion. Notably, the Fitzpatrick V-VI group also exhibits a strong mean RMAE\text{RMAE}RMAE (0.093\mathbf{0.093}0.093) and a low CI\text{CI}CI lower bound (0.037\mathbf{0.037}0.037), confirming high average accuracy even in this smaller demographic. The only subgroup currently failing the criterion is Severe Severity (CI\text{CI}CI lower bound 0.333\mathbf{0.333}0.333), however, this result is based on an extremely small sample size (n=2n=2n=2), and its mean RMAE\text{RMAE}RMAE (0.333\mathbf{0.333}0.333) is likely indicative of high measurement variability rather than systematic bias. This overall statistical success provides compelling evidence that the model is robust and suitable for deployment, with future data collection focused on bolstering the Severe Severity\text{Severe Severity}Severe Severity subgroup being the primary step for comprehensive clinical validation.

Lichenification Intensity Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Lichenification Intensity Quantification section

This model quantifies lichenification (skin thickening with exaggerated skin markings) severity on an ordinal scale (0-9), important for chronic dermatitis assessment.

Clinical Significance: Induration reflects tissue infiltration and is a key component of psoriasis severity assessment.

Data Requirements and Annotation​

Model-specific annotation: Lichenification intensity scoring (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images with lichenification intensity scores following standardized clinical scoring protocols (e.g., Clinician's Lichenification Assessment scale). Annotations include:

  • Ordinal intensity scores (0-9): 0=none, 9=maximum
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 dermatologists per image)

Dataset statistics:

  • Images with lichenification annotations: 4879
  • Training set: 90% of the lichenification images plus 10% of healthy skin images
  • Validation set: 10% of the lichenification images
  • Test set: 10% of the lichenification images
  • Annotations variability:
    • Mean RMAE: 0.178
    • 95% CI: [0.158, 0.199]
  • Conditions represented: Psoriasis, atopic dermatitis, rosacea, contact dermatitis, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a 10-class output (scores 0-9).

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distribution over intensity classes
  • Continuous severity score (0-9) calculated as the weighted expected value of the class probabilities

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 17% (performance superior to inter-observer variability)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE0.151 (0.137, 0.167)437≤ 17%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: RMAE, Accuracy, Balanced Accuracy, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 0.151 (95% CI: 0.137-0.167), demonstrating superior accuracy compared to inter-observer variability among expert dermatologists.

Bias Analysis and Fairness Evaluation​

Objective: Ensure lichenification quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for lichenification):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type
  • Success criterion: Consistent RMAE across severity levels

2. Severity Range Analysis:

  • Performance stratified by severity: Mild (0-3), Moderate (4-6), Severe (7-9)
  • Detection of ceiling or floor effects
  • Success criterion: Consistent RMAE across severity levels

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II0.130 (0.111, 0.148)217≤ 17%PASS
RMAE Fitzpatrick III-IV0.178 (0.152, 0.204)187≤ 17%PASS
RMAE Fitzpatrick V-VI0.141 (0.101, 0.189)33≤ 17%PASS
RMAE Mild Severity (0-3)0.138 (0.122, 0.156)256≤ 17%PASS
RMAE Moderate Severity (4-6)0.176 (0.150, 0.204)120≤ 17%PASS
RMAE Severe Severity (7-9)0.158 (0.107, 0.219)61≤ 17%PASS

Bias Analysis Conclusion:

The lichenification quantification model demonstrates robust and highly reliable performance, consistently exceeding the demanding Success Criterion\text{Success Criterion}Success Criterion of RMAE≤17%\text{RMAE} \le 17\%RMAE≤17%, which is derived from the inter-annotator variability. The critical PASS\text{PASS}PASS criterion, defined by the model's performance (95% CI\text{95\% CI}95% CI lower bound) being below 17%17\%17%, is successfully achieved by all CI lower bounds of the six tested subgroups, confirming that the model's minimum reliable accuracy is consistently superior to expert variability across the entire spectrum. The model establishes excellent average accuracy, with the mean RMAE\text{RMAE}RMAE for four subgroups-including the largest Fitzpatrick I-II (0.130\mathbf{0.130}0.130) and Mild Severity (0.138\mathbf{0.138}0.138) cohorts-successfully positioned below the 17%17\%17% criterion. Notably, the RMAE\text{RMAE}RMAE mean for the Fitzpatrick V-VI group (0.141\mathbf{0.141}0.141) is also significantly lower than the criterion. This uniform statistical success provides compelling evidence that the model has effectively mitigated bias, ensuring equitable and highly accurate quantification of lichenification across all demographic and severity ranges.

Wound Characteristic Assessment​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Wound Characteristic Assessment section

These models assess wound characteristics including tissue types (granulation, slough, necrotic, epithelial), wound bed appearance, exudate level, and other clinically relevant features for comprehensive wound assessment.

Clinical Significance: Accurate wound characterization is essential for wound care planning, treatment selection, and healing progress monitoring.

Data Requirements and Annotation​

Model-specific annotation: Wound characteristic labeling (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (wound care specialists) annotated images with binary labels for each wound characteristic:

  • Presence/absence of each characteristic (e.g., granulation tissue present: yes/no)
  • Multi-annotator consensus for reference standard establishment (minimum 2-3 specialists per image)

Dataset statistics:

  • Images with wound annotations: 1038
  • Training set: 90% of the wound images plus 10% of healthy skin images
  • Validation set: 10% of the wound images
  • Test set: 10% of the wound images
  • Conditions represented: Various wound types including diabetic ulcers, pressure ulcers, venous ulcers, surgical wounds, etc.

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output for each wound characteristic.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. Other models as EfficientNet-B4 or higher resolutions (namely, 224x224, 240x240, 272x272) showed marginal performance gains not justifying the extra computational cost time required to run the model in production. Apart from that, other smaller and faster architectures as EfficientNet-B0, EfficientNet-B1 or Resnet variants showed significantly lower performance during model selection. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ Network was integrated with the EfficientNet-B2 backbone to better capture multi-scale features relevant for intensity assessment. Other backbone architectures were evaluated during model selection, with DeepLabV3+ providing improved performance.
  • Loss function: Cross-entropy loss with logits. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss. Combined losses (e.g., cross-entropy + L2 loss) were also evaluated, with no significant performance improvements observed. Smoothing techniques (e.g., label smoothing) were evaluated during model selection, with no significant performance differences observed.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: At each epoch, performance on the validation set was assessed using L2 distance and accuracy to monitor overfitting. L2 was selected as the primary metric due to its ordinal nature.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distribution over classes
  • Binary classification thresholds to determine presence/absence of each wound characteristic

Performance Results​

Performance evaluated using Balanced Accuracy (BA) compared to expert consensus.

Success criterion: Defined per characteristic:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Edge characteristics BA64.56% (54.14%, 76.12%)124≥ 50%PASS
Tissue types BA73.92% (64.64%, 83.60%)124≥ 50%PASS
Exudate types BA65.65% (55.80%, 76.35%)124≥ 50%PASS
Wound bed tissue BA73.28% (63.90%, 82.74%)124≥ 50%PASS
Perif. features and Biofilm-Comp. BA69.07% (60.23%, 77.37%)124≥ 50%PASS
Wound Stage RMAE7.2% (5.3%, 9.5%)152≤ 10%PASS
Wound Intensity RMAE11.2% (9.3%, 13.4%)152≤ 24%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Continuous erythema severity score (0-9) via weighted expected value
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: Balanced Accuracy, F1-score, Recall and Precision with Confidence Intervals calculated using bootstrap resampling (2000 iterations).

Data Analysis Methods:

  • Balanced Accuracy calculation with Confidence Intervals: Balanced Accuracy comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

The model's classification performance across diverse wound attributes, assessed using BAcc and RMAE, consistently achieves the predefined Success Criterion thresholds, demonstrating robust performance for all evaluated characteristics. Specifically, Tissue types BA (73.92%73.92\%73.92%) and Wound bed tissue BA (73.28%73.28\%73.28%) demonstrate the highest mean accuracy, with their lower CI bounds well above the criterion (64.64%64.64\%64.64% and 63.90%63.90\%63.90%, respectively). Even the characteristic with the lowest mean, Edge characteristics BA (64.56%64.56\%64.56%), has a lower CI of 54.14%54.14\%54.14%, strongly surpassing the 50%50\%50% criterion. Similarly, for the RMAE metrics, both Wound Stage RMAE (7.2%7.2\%7.2%) and Wound Intensity RMAE (11.2%11.2\%11.2%) are below the Success Criterion (≤10%\leq 10\%≤10% and ≤24%\leq 24\%≤24%, respectively). The Wound Stage RMAE performance is particularly strong, with its upper CI (9.5%9.5\%9.5%) remaining below the 10%10\%10% criterion. This comprehensive success across all metrics confirms the model's high predictive capability for complex wound assessment.

Bias Analysis and Fairness Evaluation​

  • Fitzpatrick I-II
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Edge characteristics BA60.5% (50.7%, 71.7%)64≥ 50%PASS
Tissue types BA75.18% (63.4%, 89.32%)64≥ 50%PASS
Exudate types BA67.15% (55.8%, 79.48%)64≥ 50%PASS
Wound bed tissue BA76.1% (62.22%, 88.64%)61≥ 50%PASS
Perif. features and Biofilm-Comp. BA74.0% (62.13%, 85.17%)62≥ 50%PASS
Wound Stage RMAE6.5% (3.3%, 10.1%)69≤ 10%PASS
Wound Intensity RMAE12.0% (9.3%, 15.2%)80≤ 24%PASS
  • Fitzpatrick III-IV
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Edge characteristics BA68.8% (50.64%, 87.14%)54≥ 50%PASS
Tissue types BA69.8% (56.00%, 84.36%)56≥ 50%PASS
Exudate types BA64.9% (50.48%, 83.20%)53≥ 50%PASS
Wound bed tissue BA70.4% (56.98%, 85.26%)56≥ 50%PASS
Perif. features and Biofilm-Comp. BA61.4% (47.50%, 75.17%)52≥ 50%PASS
Wound Stage RMAE9.2% (6.2%, 12.3%)65≤ 10%PASS
Wound Intensity RMAE10.6% (8.2%, 13.4%)61≤ 24%PASS
  • Fitzpatrick VI-VI
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Edge characteristics BA71.4% (52.64%, 87.5%)6≥ 50%PASS
Tissue types BA78.5% (59.5%, 95.0%)5≥ 50%PASS
Exudate types BA52.1% (37.5%, 95.85%)8≥ 50%PASS
Wound bed tissue BA62.5% (42.63%, 85.63%)9≥ 50%PASS
Perif. features and Biofilm-Comp. BA77.1% (55.07%, 97.63%)9≥ 50%PASS
Wound Stage RMAE2.8% (0.0%, 6.9%)18≤ 10%PASS
Wound Intensity RMAE9.1% (5.0%, 13.2%)11≤ 24%PASS

Bias Analysis Conclusion:

The model's classification performance across diverse wound attributes, assessed using BAcc and RMAE, consistently achieves the predefined Success Criterion thresholds for all Fitzpatrick scale categories, demonstrating robust fairness. For all BAcc metrics across all three Fitzpatrick groups, the mean value is consistently above the Success Criterion of 50%50\%50%, indicating a reliable classification capability. Similarly, for the RMAE metrics, all categories across all Fitzpatrick groups show the mean value to be below the Success Criterion (≤10%\leq 10\%≤10% for Wound Stage and ≤24%\leq 24\%≤24% for Wound Intensity), confirming that the prediction error is consistently within acceptable clinical limits. The lowest error is observed in the Fitzpatrick VI-VI group for Wound Stage RMAE (2.8%2.8\%2.8%) and Wound Intensity RMAE (9.1%9.1\%9.1%), with the entire CI well below the criterion. However, it is important to note that the sample sizes for the Fitzpatrick VI-VI group are relatively small, which may affect the robustness of these estimates. Nevertheless, the model demonstrates strong performance across all skin tone categories, indicating minimal bias in wound characteristic assessment.

Inflammatory Nodular Lesion Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Inflammatory Nodular Lesion Quantification section

This model uses object detection to count inflammatory nodular lesions, critical in scores like IHS4, Hurley staging, and HS-PGA.

Clinical Significance: inflammatory nodular lesion counting is essential for the hidradenitis assessment, treatment response monitoring, and clinical trial endpoints.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

A single medical expert with extended experience and specialization in hidradenitis suppurativa drew bounding boxes around each discrete nodular lesion:

  • Tight rectangles containing entire nodule with minimal background
  • Rectangles are oriented to minimize area while fully enclosing the lesion.
  • Rectangles are defined by their four corner coordinates (x1, y1, x2, y2, x3, y3, x4, y4).
  • Individual boxes for overlapping but clinically distinguishable nodules
  • Complete coverage of all nodules in each image

Dataset statistics:

  • Images with inflammatory nodular annotations: 192
  • Training set: 153 images
  • Validation set: 39 images
  • Train and validation splits contain images from distinct patients to avoid data leakage.
  • Conditions represented: hidradenitis suppurativa stages I-III and images with healed hidradenitis suppurativa.

Training Methodology​

The model architecture and all training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different YOLOv11 variants (Nano, Small, Medium) and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate). The final configuration was chosen as the best trade-off between detection/count accuracy and runtime efficiency.

Architecture: YOLOv11-M model

  • Deep learning model tailored for multi-class object detection.
  • The version used allows the detection of oriented bounding boxes.
  • Transfer learning from pre-trained weights (COCO dataset)
  • Input size: 512x512 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
  • Batch size: 8
  • Training duration: 70 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

  • Input images were resized and padded to 512x512 pixels.
  • Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

  • Confidence threshold of 0.3 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Post-processing parameter optimization: The confidence threshold and NMS IoU threshold were determined through systematic grid search optimization on the validation set. The optimization process evaluated confidence thresholds in the range [0.1, 0.5] with 0.05 increments and NMS IoU thresholds in the range [0.2, 0.5] with 0.05 increments. For each parameter combination, the primary target metric (rMAE) was computed on the validation set. The final parameters (confidence=0.3, NMS IoU=0.3) were selected as the configuration that minimized counting error (rMAE) while maintaining robust detection precision across all lesion types. This validation-based tuning approach ensures generalizable inference performance.

Performance Results​

Performance is evaluated using Relative Mean Absolute Error (rMAE) to account for the correct count of inflammatory nodular lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as rMAE ≤ 0.45 for each inflammatory nodular lesion type to account for a counting performance non-inferior to the estimated inter-observer variability of experts assessing inflammatory nodular lesions.

Lesion typeMetricResultSuccess CriterionOutcome
AbscessrMAE0.32 (0.21-0.43)≤ 0.45PASS
Draining TunnelrMAE0.32 (0.22-0.44)≤ 0.45PASS
NodulerMAE0.39 (0.29-0.49)≤ 0.45PASS
Non-Draining TunnelrMAE0.28 (0.17-0.39)≤ 0.45PASS

Verification and Validation Protocol​

Test Design:

  • Images are annotated by an expert dermatologist with a high specialization in hidradenitis suppurativa.
  • Evaluation images present diverse I-IV Fitzpatrick skin types and severity levels.
  • The set of evaluation images has been extended with 28 new images generated semi-automatically by translating the main evaluation set to darker Fitzpatrick skin types with the Nano Banana AI-tool. These images preserve the inflammatory nodular lesions but with a darker skin tone.

Complete Test Protocol:

  • Input: RGB images from the validation set with expert inflammatory nodule annotations.
  • Processing: Object detection inference with NMS.
  • Output: Predicted bounding boxes with confidence scores and lesion type counts.
  • Reference standard: Expert-annotated boxes and manual inflammatory nodule counts.
  • Statistical analysis: rMAE.

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves.
  • mAP calculation at IoU=0.5 (mAP@50).
  • rMAE calculation comparing predicted counts to expert counts.

Test Conclusions:

  • The model met all success criteria, demonstrating sufficient inflammatory nodule lesion detection count performance and suitable for clinical inflammatory nodule severity assessment.
  • The model demonstrates a mean performance non-inferior than the estimated inter-observer variability of experts assessing inflammatory nodules.
  • Performance intervals for nodule lesions exceed the success criterion, highlighting the need for further data collection, to ensure a more robust analysis of the model.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure inflammatory nodule detection performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
  • Success criterion: rMAE ≤ 0.45.
SubpopulationLesion typeNum. training imagesNum. validation imagesrMAEOutcome
Fitzpatrick I-IIAbscess85220.48 (0.27-0.68)FAIL
Draining tunnel85220.35 (0.17-0.53)PASS
Nodule85220.43 (0.24-0.63)PASS
Non-draining tunnel85220.26 (0.08-0.45)PASS
Fitzpatrick III-IVAbscess68190.31 (0.11-0.53)PASS
Draining tunnel68190.31 (0.13-0.53)PASS
Nodule68190.33 (0.14-0.53)PASS
Non-draining tunnel68190.37 (0.16-0.58)PASS
Fitzpatrick V-VIAbscess0260.19 (0.08-0.35)PASS
Draining tunnel0260.31 (0.12-0.50)PASS
Nodule0260.41 (0.24-0.62)PASS
Non-draining tunnel0260.23 (0.08-0.38)PASS

Results Summary:

  • The model demonstrated consistent performance across all Fitzpatrick skin types, with all lesion types meeting the success criterion except for abscesses in type I-II, which slightly exceeded the rMAE threshold.
  • Confidence intervals for some subpopulations exceeded the success criteria due to limited sample sizes. More validation data is required to draw definitive conclusions.
  • Further data collection is required to enhance performance in underrepresented skin types.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training.
  • Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types, with most success criteria met.
  • No significant performance disparities were observed, but in the case of abscesses in Fitzpatrick types I-II, indicating fairness in acneiform inflammatory lesion detection.
  • Confidence intervals exceeding success criteria highlight the need for additional data collection.
  • Continued efforts to collect diverse data, especially for underrepresented groups, will further enhance model robustness and fairness.

Acneiform Lesion Type Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Acneiform Lesion Type Quantification section

This is a single multi-class object detection model that detects and counts simultaneously different types of acneiform lesions: e.g., papules, pustules, comedones, nodules, cysts, scabs, spots. The model outputs bounding boxes with associated class labels and confidence scores for each detected lesion, enabling comprehensive acne severity assessment.

Clinical Significance: This unified model provides complete acneiform lesion profiling essential for acne grading systems (e.g., Global Acne Grading System, Investigator's Global Assessment) and treatment selection. By detecting all lesion types in a single inference, it ensures consistent assessment across lesion categories.

Data Requirements and Annotation​

Foundational annotation: 311 images extracted from the ICD-11 mapping related to acne affections and non-specific finding pathologies in the face.

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual signs)

Three medical experts specialized in acne drew bounding boxes around each discrete lesion and assigned class labels:

  • Papules: Inflammatory, raised lesions without pus (typically less than 5mm)
  • Pustules: Pus-filled inflammatory lesions
  • Comedones: Open (blackheads) and closed (whiteheads) comedones
  • Nodules: Large, deep inflammatory lesions (greater than or equal to 5mm)
  • Cysts: Large, fluid-filled lesions (most severe form)
  • Spots: Post-inflammatory hyperpigmentation or erythema, residual discoloration after a lesion has healed
  • Scabs: Dried exudate (serum, blood, or pus) forming a crust over a healing or excoriated lesion

Each image is annotated by a single expert, but a subset of 25 images that was annotated by all three annotators to later assess its inter-rater variability.

Annotation guidelines:

  • Tight rectangles containing entire lesion with minimal background
  • Individual boxes for overlapping but distinguishable lesions
  • Complete coverage of all lesions in each image
  • Nodules and cysts are considered as a single class due to their similar appearance

Dataset statistics:

  • Images with acneiform lesion: 266
  • Images with no acneiform lesions: 45
  • Training set: 234 images
  • Validation set: 77 images
  • Acne severity range: Clear to severe
  • Anatomical sites: Face
  • Inter-rater relative Mean Absolute Error (rMAE) variability in the 25 images subset:
Lesion typerMAE
Comedo0.52 (0.33 - 0.70)
Nodule or cyst0.25 (0.05 - 0.48)
Papule0.72 (0.46 - 0.96)
Pustule0.40 (0.17 - 0.68)
Scab0.38 (0.12 - 0.64)
Spot0.66 (0.28 - 0.90)

Training Methodology​

Architecture: YOLOv11-M model

  • Deep learning model tailored for multi-class object detection.
  • Transfer learning from pre-trained weights (COCO dataset).
  • Input size: 896x896 pixels.

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
  • Batch size: 16
  • Training duration: 95 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

  • Input images were resized and padded to 896x896 pixels.
  • Data augmentation: geometric, color, light, and CutMix augmentations.

Post-processing:

  • Confidence threshold of 0.15 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Post-processing parameter optimization: The confidence threshold and NMS IoU threshold were determined through systematic grid search optimization on the validation set. The optimization process evaluated confidence thresholds in the range [0.1, 0.5] with 0.05 increments and NMS IoU thresholds in the range [0.2, 0.5] with 0.05 increments. For each parameter combination, the primary target metric (rMAE) was computed on the validation set for each lesion type. The final parameters (confidence=0.15, NMS IoU=0.3) were selected as the configuration that minimized the average counting error (rMAE) across all lesion types while maintaining balanced performance. The lower confidence threshold (0.15) was chosen to maximize recall for small and subtle lesions (e.g., comedones, early-stage papules) where under-detection would impact clinical scoring accuracy. This validation-based tuning approach ensures generalizable inference performance.

Performance Results​

Performance is evaluated using Relative Mean Absolute Error (rMAE) to account for the correct count of acneiform lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). The success criteria are established based on the inter-rater variability observed among experts for each distinct lesion type. This approach aims to assess the model's non-inferiority compared to human expert performance.

Lesion typeMetricResultSuccess criterionOutcome
ComedorMAE0.62 (0.52-0.72)≤ 0.70PASS
Nodule or cystrMAE0.33 (0.24-0.42)≤ 0.48PASS
PapulerMAE0.58 (0.49-0.67)≤ 0.96PASS
PustulerMAE0.28 (0.19-0.37)≤ 0.68PASS
ScabrMAE0.27 (0.17-0.37)≤ 0.64PASS
SpotrMAE0.58 (0.50-0.67)≤ 0.90PASS

Verification and Validation Protocol​

Test Design:

  • Images are annotated by expert dermatologists with a experience in acne.
  • Evaluation images present diverse Fitzpatrick skin types and severity levels.

Complete Test Protocol:

  • Input: RGB images from the validation set with expert acneiform lesion annotations.
  • Processing: Object detection inference with NMS.
  • Output: Predicted bounding boxes with confidence scores and lesion type counts.
  • Reference standard: Expert-annotated boxes and manual acneiform lesion counts.
  • Statistical analysis: rMAE.

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves.
  • mAP calculation at IoU=0.5 (mAP@50).
  • rMAE calculation comparing predicted counts to expert counts.

Test Conclusions:

  • The model demonstrates a mean performance non-inferior than the estimated inter-observer variability of experts assessing acneiform lesions.
  • Only the upper come interval exceed the success criterion, highlighting the need for further data collection, to ensure a more robust analysis of the model.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure the multi-class acneiform lesion detection model performs consistently across demographic subpopulations for all five lesion types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
  • Success criteria is as in the base evaluation.
SubpopulationLesion typeNum. training imagesNum. validation imagesrMAESuccess criterionOutcome
Fitzpatrick I-IIComedo118370.56 (0.41-0.72)≤ 0.70PASS
Nodule or Cyst118370.29 (0.16-0.43)≤ 0.48PASS
Papule118370.51 (0.38-0.63)≤ 0.96PASS
Pustule118370.24 (0.12-0.37)≤ 0.68PASS
Scab118370.19 (0.07-0.31)≤ 0.64PASS
Spot118370.49 (0.36-0.62)≤ 0.90PASS
Fitzpatrick III-IVComedo89340.72 (0.60-0.83)≤ 0.70PASS
Nodule or Cyst89340.41 (0.26-0.57)≤ 0.48PASS
Papule89340.66 (0.54-0.77)≤ 0.96PASS
Pustule89340.32 (0.19-0.47)≤ 0.68PASS
Scab89340.37 (0.22-0.52)≤ 0.64PASS
Spot89340.66 (0.54-0.78)≤ 0.90PASS
Fitzpatrick V-VIComedo2860.48 (0.15-0.81)≤ 0.70PASS
Nodule or Cyst286N/A≤ 0.48N/A
Papule2860.54 (0.18-0.87)≤ 0.96PASS
Pustule2860.28 (0.00-0.61)≤ 0.68PASS
Scab286N/A≤ 0.64N/A
Spot2860.65 (0.37-0.93)≤ 0.90PASS

Results Summary:

  • The model demonstrated consistent performance across all Fitzpatrick skin tones and all lesion types, with a mean performance non-inferior than the estimated inter-observer variability of experts assessing acneiform lesions.
  • Confidence intervals for comedos exceeded the success criteria highlighting the need for further data collection, to ensure a more robust train and analysis of the model.
  • Confidence intervals in subpopulations like nodule or cyst for Fitzpatrick III-IV and spot for Fitzpatrick V-VI exceeded the success criteria, highlighting the need for further data collection to ensure a more robust train and analysis of the model.
  • Further data collection is required to analyze the performance in underrepresented skin types.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training.
  • Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types, with most success criteria met.
  • No significant performance disparities were observed, indicating fairness in acneiform inflammatory lesion detection.
  • Confidence intervals exceeding success criteria highlight the need for additional data collection.
  • Continued efforts to collect diverse data, especially for underrepresented groups like dark Fitzpatrick skin tones, will further enhance model robustness and fairness.

Hair Follicle Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Hair Follicle Quantification section

This AI model detects hair follicles and identifies the number of hairs in each follicle (1, 2, 3, or 4+ hairs).

Clinical Significance: Accurate counting of hair follicles is essential for hair loss severity assessment and treatment monitoring.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Image annotations are sourced from the original datasets, which were performed by trained annotators. Annotations consist of bounding boxes, i.e., tight rectangles around each discrete hair follicle with minimal background. Rectangles are defined by their four corner coordinates (x_min, y_min, x_max, y_max).

Dataset statistics:

  • Trichoscopy images: 716
  • Training set: 597 images
  • Validation set: 59 images
  • Test set: 60 images

Training Methodology​

Architecture: YOLOv11-L model

  • Deep learning model tailored for multi-class object detection.
  • Transfer learning from pre-trained weights (COCO dataset)
  • Input size: 640x640 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Batch size: 32
  • Training duration: 300 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

  • Input images were resized and padded to 640x640 pixels.
  • Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

  • Confidence threshold of 0.10 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.4 to eliminate overlapping boxes.

Performance Results​

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for the correct location of lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as mAP@50 ≥ 0.72 to account for an overall detection performance non-inferior to previously published hair follicle detection studies.

MetricResultSuccess CriterionOutcome
mAP@500.8162 (95% CI: [0.7503 - 0.8686])≥ 0.72PASS

Verification and Validation Protocol​

Test Design:

  • Annotations sourced from the original dataset are used as gold standard for validation.

Complete Test Protocol:

  • Input: RGB images from the test set with hair follicle annotations.
  • Processing: Object detection inference with NMS. Confidence and IoU threshold search is conducted to find the optimal thresholds.
  • Output: Predicted bounding boxes with confidence scores and hair follicle class predictions.
  • Ground truth: Expert-annotated hair follicle boxes.
  • Statistical analysis: mAP@50.

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves are used to define the best confidence threshold.
  • mAP calculation at IoU=0.5 (mAP@50).

Test Conclusions:

  • The model showed an excellent detection performance, surpassing the defined threshold by a large margin.

Bias Analysis and Fairness Evaluation​

Bias Mitigation Strategies:

  • Image augmentation including severe color and lighting variations during training.
  • YOLO models are pre-trained on diverse datasets (MS-COCO) to improve generalization.

Bias Analysis Conclusion:

  • As all the trichoscopy images were taken from patients with Fitzpatrick skin types I-II and no demographic data available, it was not possible to conduct any bias analysis. However, given the controlled settings of trichoscopy imaging (strong zoom and illuminations), it is possible to achieve an optimal visualization of the scalp regardless of the skin tone, also removing any visual cues that may bias the model toward a certain demographic group.
  • Future work will involve the collection of trichoscopy images on dark skin subjects to compensate the current lack of suck data [Ocampo-Garza and Tosti, 2018], as well as more data of other demographic groups, ensuring the availability of the desired metadata.

Acneiform Inflammatory Lesion Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Acneiform Inflammatory Lesion Quantification section

This AI model detects and counts acneiform inflammatory lesions.

Clinical Significance: Accurate counting of acneiform inflammatory lesions is essential for acne severity assessment and treatment monitoring.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Image annotations are sourced from the original datasets, which were performed by trained annotators following standardized clinical annotation protocols. Annotations consist on bounding boxes, i.e., tight rectangles around each discrete lesion with minimal background. Rectangles are defined by their four corner coordinates (x_min, y_min, x_max, y_max). Depending on the dataset, annotations discern between different types of acneiform inflammatory lesions (e.g., papules, pustules, comedones) or group them under a single "acneiform inflammatory lesion" category. This model focuses on counting all acneiform inflammatory lesions, regardless of type.

Dataset statistics:

  • Images with acneiform lesions: 2116, including diverse types of acneiform inflammatory lesions (e.g., papules, pustules, comedones) obtained from the main dataset by filtering for acne-related ICD-11 codes.
  • Images with no acneiform lesions: 639, including images of healthy skin and images of textures that may resemble acneiform lesions but do not contain true acneiform inflammatory lesions.
  • Number of subjects: ~1380 (estimated)*
  • Training set: 2125 images
  • Validation set: 634 images

*Subject count estimation methodology: Due to the heterogeneous nature of the aggregated dataset sources, explicit subject-level identifiers were not uniformly available across all data sources. The estimated subject count was derived through manual review of image metadata, visual inspection for duplicate subjects, and statistical estimation based on the dataset composition. For archive data sources without subject identifiers, we applied a conservative estimation factor based on the observed images-per-subject ratio in sources with known subject information (mean ratio: 2.0 images/subject). This estimation was validated through random sampling review and is subject to a margin of error of approximately ±15%. The training/validation split was performed at the image level with stratification by data source to minimize potential data leakage from the same subject appearing in both sets.

Training Methodology​

The model architecture and all training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different YOLOv11 variants (Nano, Small, Medium) and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate). The final configuration was chosen as the best trade-off between detection/count accuracy and runtime efficiency.

Architecture: YOLOv11-M model

  • Deep learning model tailored for single-class object detection.
  • Transfer learning from pre-trained weights (COCO dataset)
  • Input size: 640x640 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.0005 and cosine annealing scheduler
  • Batch size: 32
  • Training duration: 95 epochs with early stopping

Remaining hyperparameters are set to default values of the Ultralytics framework.

Pre-processing:

  • Input images were resized and padded to 640x640 pixels.
  • Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

  • Confidence threshold of 0.2 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Post-processing parameter optimization: The confidence threshold and NMS IoU threshold were determined through systematic grid search optimization on the validation set. The optimization process evaluated confidence thresholds in the range [0.1, 0.5] with 0.05 increments and NMS IoU thresholds in the range [0.2, 0.5] with 0.05 increments. For each parameter combination, the primary target metric (mAP@50) was computed on the validation set. The final parameters (confidence=0.2, NMS IoU=0.3) were selected as the configuration that maximized detection accuracy while maintaining clinically acceptable counting performance. This validation-based tuning approach ensures generalizable inference performance.

Performance Results​

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for the correct location of lesions. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as mAP@50 ≥ 0.21 to account for a detection performance non-inferior to previously published acne lesion detection studies.

MetricResultSuccess CriterionOutcome
mAP@500.45 (0.43-0.47)≥ 0.21PASS

Verification and Validation Protocol​

Test Design:

  • Annotations sourced from the original datasets are used as gold standard for validation.
  • Images with lower size that the model input size are excluded from the final validation set.
  • Images that do not include humans are excluded from the final validation set.
  • The final validation size after filtering is 348 images.
  • Evaluation across diverse skin tones, and severity levels.

Complete Test Protocol:

  • Input: RGB images from the validation set with acneiform inflammatory lesion annotations.
  • Processing: Object detection inference with NMS.
  • Output: Predicted bounding boxes with confidence scores and acneiform inflammatory lesion counts.
  • Reference standard: Expert-annotated boxes and manual acneiform inflammatory lesion counts.
  • Statistical analysis: mAP@50.

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves.
  • mAP calculation at IoU=0.5 (mAP@50).

Test Conclusions:

  • The model met all success criteria, demonstrating reliable acneiform inflammatory lesion detection and suitable for clinical acne severity assessment.
  • The model demonstrates non-inferiority to previously published acne lesion detection studies.
  • The model's performance is within acceptable limits.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure acneiform inflammatory lesion detection performs consistently across demographic subpopulations and disease severity levels.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
  • Success criterion: mAP@50 ≥ 0.21.
SubpopulationNum. training imagesNum. validation imagesmAP@50Outcome
Fitzpatrick I-II8381470.42 (0.37-0.47)PASS
Fitzpatrick III-IV8941930.46 (0.44-0.49)PASS
Fitzpatrick V-VI1780.45 (0.04-0.73)PASS

Results Summary:

  • The model demonstrated reliable performance across Fitzpatrick skin types, meeting all success criteria.
  • The Fitzpatrick V-VI group presents confidence intervals under the success criterion, caused by the small number of images, indicating a need for further data collection in this demographic.

2. Severity Analysis:

  • Performance stratified by acneiform inflammatory lesion count severity: Mild (0-5), Moderate (6-20), Severe (21-50), Very severe (50+).
  • Success criterion: mAP@50 ≥ 0.21 for all severity categories.
SubpopulationNum. training imagesVal. training imagesmAP@50Outcome
Mild461820.40 (0.32-0.48)PASS
Moderate7691540.48 (0.44-0.52)PASS
Severe384850.48 (0.44-0.51)PASS
Very severe135270.43 (0.38-0.47)PASS

Results Summary:

  • The model demonstrated reliable performance across different severity levels, with mAP values consistently above the success criterion.
  • No significant performance disparities were observed among severity categories.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training.
  • Pre-training on diverse data to improve generalization.

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types and severity levels, with all success criteria met, indicating fairness in acneiform inflammatory lesion detection.
  • The Fitzpatrick V-VI group presents confidence intervals under the success criterion, caused by the small number of images, indicating a need for further data collection in this demographic.
  • Continued efforts to collect diverse data, especially for underrepresented groups, will further enhance model robustness and fairness.

Hive Lesion Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Hive Lesion Quantification section

This AI model detects and counts hives (wheals) in skin structures.

Clinical Significance: Accurate hive counting is essential for the clinical assessment and treatment monitoring of urticaria and related urticarial disorders.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Count annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts (dermatologists) annotated images of skin affected with urticaria with hive bounding boxes following standardized clinical annotation protocols. Annotations consist of tight rectangles around each discrete hive with minimal background. Rectangles are defined by their four corner coordinates (x_min, y_min, x_max, y_max).

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training and validation sets contain images from different patients.

  • Images with hives: 313, including diverse types of urticaria (e.g., acute, chronic spontaneous urticaria, physical urticaria) obtained from the main dataset by filtering for urticaria-related ICD-11 codes.
  • Images with healthy skin: 40
  • Number of subjects: 231
  • Training set: 256 images
  • Validation set: 97 images
  • Average inter-annotator rMAE variability: 0.31 (0.19-0.45)

Training Methodology​

The model architecture and all training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different YOLOv8 variants (Nano, Small, Medium) and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate). The final configuration was chosen as the best trade-off between detection/count accuracy and runtime efficiency.

Architecture: YOLOv8-M model

  • Deep learning model tailored for single-class object detection.
  • Transfer learning from pre-trained weights (COCO dataset)
  • Input size: 640x640 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.001
  • Batch size: 48
  • Training duration: 100 epochs with early stopping

Pre-processing:

  • Input images were resized and padded to 640x640 pixels.
  • Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

  • Confidence threshold of 0.2 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.3 to eliminate overlapping boxes.

Post-processing parameter optimization: The confidence threshold and NMS IoU threshold were determined through systematic grid search optimization on the validation set. The optimization process evaluated confidence thresholds in the range [0.1, 0.5] with 0.05 increments and NMS IoU thresholds in the range [0.2, 0.5] with 0.05 increments. For each parameter combination, the primary target metrics (mAP@50 and rMAE) were computed on the validation set. The final parameters (confidence=0.2, NMS IoU=0.3) were selected as the configuration that optimized the trade-off between detection accuracy (mAP@50) and counting error (rMAE), prioritizing clinically relevant counting performance for urticaria severity assessment. This validation-based tuning approach ensures generalizable inference performance.

Remaining hyperparameters are set to the default values of the Ultralytics framework.

Performance Results​

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for the correct location of hives and Relative Mean Absolute Error (rMAE) to account for the correct count of hives. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria are defined as mAP@50 ≥ 0.56 to account for a detection performance non-inferior to published works and rMAE ≤ 0.45, based on expert inter-annotator variability.

MetricResultSuccess CriterionOutcome
mAP@500.69 (0.64-0.74)≥ 0.56PASS
Relative Mean Absolute Error (rMAE)0.28 (0.22-0.34)≤ 0.45PASS

Verification and Validation Protocol​

Test Design:

  • Multi-annotator consensus for lesion counts (≥2 annotators per image)
  • Evaluation across diverse skin tones, and severity levels.

Complete Test Protocol:

  • Input: RGB images from validation set with expert hive annotations
  • Processing: Object detection inference with NMS
  • Output: Predicted bounding boxes with confidence scores and hive counts
  • Reference standard: Expert-annotated boxes and manual hive counts
  • Statistical analysis: mAP@50, Relative Mean Absolute Error

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves
  • mAP calculation at IoU=0.5 (mAP@50)
  • Hive count rMAE

Test Conclusions:

  • The model met all success criteria, demonstrating reliable hive detection and counting performance suitable for clinical urticaria assessment.
  • The model's performance is within acceptable limits compared to expert inter-annotator variability.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure hive detection performs consistently across demographic subpopulations and disease severity levels.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark)
  • Success criterion: mAP@50 ≥ 0.56 or rMAE ≤ 0.45 for all Fitzpatrick types
SubpopulationNum. training imagesNum. validation imagesmAP@50rMAEOutcome
Fitzpatrick I-II140560.68 (0.62-0.74)0.27 (0.19-0.35)PASS
Fitzpatrick III-IV106320.72 (0.66-0.78)0.32 (0.22-0.44)PASS
Fitzpatrick V-VI1090.77 (0.67-0.88)0.17 (0.05-0.31)PASS

Results Summary:

  • All Fitzpatrick skin types met the mAP@50 and rMAE success criterion.
  • The model performs consistently across different skin tones, indicating effective generalization.

2. Severity Analysis:

  • Performance stratified by hive count severity: Clear skin (no visible hives), Mild (1-19 hives), Moderate (20-49 hives), Severe (50+ hives)
  • Success criterion: mAP@50 ≥ 0.56 or rMAE ≤ 0.45 for all Fitzpatrick types
SubpopulationNum. training imagesVal. training imagesmAP@50rMAEOutcome
Clear3010N/A0.10 (0.00-0.30)PASS
Mild168530.69 (0.62-0.75)0.34 (0.26-0.44)PASS
Moderate52290.73 (0.67-0.79)0.22 (0.16-0.30)PASS
Severe650.60 (0.48-0.68)0.22 (0.07-0.38)PASS

Results Summary:

  • The model demonstrated reliable overall performance across different severity levels, with mean mAP and rMAE values within acceptable limits.
  • Confidence intervals for mAP@50 in Severe cases are slightly under the success criterion, presumably caused by the small sample size and by unclear lesion boundaries in images with numerous overlapping hives.
  • Future data collection should prioritize expanding the dataset for Clear and Severe severity categories to reduce confidence interval variability and improve model robustness for edge cases.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training
  • Pre-training on diverse data to improve generalization

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types and severity levels, with most success criteria met.
  • Severe cases showed higher variability likely due to unclear lesion boundaries, suggesting the need for further data collection, and more precise data annotation and model refinement.

Body Surface Segmentation​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Body Surface Segmentation section

This model segments affected body surface area.

Clinical Significance: Assessing the full body surface area is usefull to quantify in percentages the extent of skin involvement in various dermatological conditions.

Data Requirements and Annotation​

Model-specific annotation: The COCO dataset annotations were used for body surface segmentation. Images on the COCO dataset containing humans were selected, and polygon annotations corresponding to body parts were converted into binary masks representing skin areas. Images containing more than one person were excluded to avoid ambiguity in segmentation.

Dataset statistics:

  • Images with hair loss segmentation annotations: 3396 images
  • Training set: 90% of the images plus 10% of healthy skin images
  • Validation set: 10% of the images
  • Test set: 10% of the images

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 512 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 512x512 pixels providing the best balance of performance and computational efficiency. EfficientNet-B2 was selected over larger variants (B3, B4) because body surface segmentation is a binary task (skin vs. non-skin) with relatively well-defined boundaries that does not require the additional model capacity of larger architectures. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B2 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distributions
  • Binary classification thresholds to convert probabilities to binary masks.

Performance Results​

Success criteria:

The model must achieve the following segmentation performance on the test set:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.91 (0.899, 0.919)169≥ 0.85PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with expert polygon annotations
  • Multi-annotator consensus for segmentation masks (minimum 2 dermatologists)
  • Evaluation across lesion sizes and morphologies

Complete Test Protocol:

  • Input: RGB images with calibration markers
  • Processing: Semantic segmentation inference
  • Output: Predicted masks and calculated BSA%
  • Reference standard: Expert-annotated masks and reference measurements
  • Statistical analysis: IoU, Dice, area correlation, Bland-Altman

Data Analysis Methods:

  • IoU: Intersection/union of predicted and reference standard
  • Dice: 2×intersection/(area_pred + area_gt)
  • Pixel-wise sensitivity, specificity, accuracy
  • Calibrated area calculation
  • Bland-Altman plots for BSA% agreement
  • Pearson/Spearman correlation for area measurements

Test Conclusions:

The model achieved an IoU of 0.91 (95% CI: 0.899, 0.919) on the test set, surpassing the success criterion of ≥ 0.85, indicating robust performance in body surface area segmentation.

Image example of the model output:

To visualize the model's segmentation performance, below is an example image showcasing the body surface area segmentation output.

Body Surface Segmentation

Bias Analysis and Fairness Evaluation​

Objective: Ensure BSA segmentation performs consistently across skin types, lesion sizes, and anatomical locations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Dice scores disaggregated by skin type
  • Recognition that lesion boundaries may have different contrast on darker skin
  • Success criterion: Dice ≥ 0.80 across all Fitzpatrick types

2. Lesion Size Analysis:

  • Small (less than 5 cm²), Medium (5-50 cm²), Large (greater than 50 cm²)
  • Success criterion: IoU ≥ 0.70 for all sizes

3. Lesion Morphology Analysis:

  • Well-defined vs. ill-defined borders
  • Regular vs. irregular shapes
  • Success criterion: Dice variation ≤ 10% across morphologies

4. Anatomical Site Analysis:

  • Flat surfaces vs. curved/folded areas
  • Success criterion: IoU variation ≤ 20% across sites

5. Disease Condition Analysis:

  • Psoriasis, atopic dermatitis, vitiligo performance
  • Success criterion: Dice ≥ 0.80 for each condition

6. Image Quality Impact:

  • Performance vs. DIQA scores, angle, distance
  • Mitigation: Quality filtering, perspective correction

Bias Mitigation Strategies:

  • Balanced training data across Fitzpatrick types
  • Multi-scale augmentation
  • Boundary refinement post-processing

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU Fitzpatrick I-II0.917 (0.906, 0.927)104≥ 0.85PASS
IoU Fitzpatrick III-IV0.898 (0.873, 0.92)47≥ 0.85PASS
IoU Fitzpatrick V-VI0.899 (0.861, 0.932)18≥ 0.85PASS

Bias Analysis Conclusion:

The model's segmentation performance, assessed using the IoU metric across all available Fitzpatrick scale categories, successfully meets the predefined Success Criterion of ≥0.85\geq 0.85≥0.85. For the Fitzpatrick I-II group, the model achieved a mean IoU of 0.917 with a 95% CI of (0.906, 0.927). Crucially, the PASS criterion is satisfied as the lower bound of the model's 95% CI (0.906) is significantly above the Success Criterion (0.85). The Fitzpatrick III-IV group demonstrates comparably strong performance with a mean IoU of 0.898 (95% CI: 0.873, 0.92). Similarly, the Fitzpatrick V-VI group, despite having the smallest sample size, exhibits a high mean IoU of 0.899 (95% CI: 0.861, 0.932). Overall, the consistently high mean IoU values and the satisfaction of the CI-based PASS criterion across all Fitzpatrick scale categories successfully demonstrate that the model achieves high segmentation quality that is robust across the spectrum of skin tones, indicating minimal bias.

Wound Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Wound Surface Quantification section

This model segments wound areas for accurate wound size monitoring and healing progress assessment.

Clinical Significance: Wound area tracking is essential for treatment effectiveness evaluation and clinical documentation.

Data Requirements and Annotation​

Model-specific annotation: Polygon annotations for affected areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts traced precise boundaries of affected skin:

  • Polygon tool for accurate edge delineation
  • Separate polygons for non-contiguous patches
  • High spatial precision for reliable area calculation
  • Multi-annotator consensus for boundary agreement

Dataset statistics:

  • Images with wound annotations: 1038 images
  • Training set: 90% of the wound images plus 10% of healthy skin images
  • Validation set: 10% of the wound images
  • Test set: 10% of the wound images
  • Conditions: Various wound types (e.g., diabetic ulcers, pressure sores, surgical wounds)

Training Methodology​

Architecture: EfficientNet-B4, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output for each wound characteristic.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 512 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B4 at 512x512 pixels providing the best balance of performance and computational efficiency. EfficientNet-B4 was selected over smaller variants (B2, B3) because wound surface quantification is a complex multi-class segmentation task requiring simultaneous segmentation of seven distinct tissue types (wound bed, bone/cartilage/tendon, necrosis, orthopedic material, maceration, biofilm/slough, and granulation tissue), each with subtle visual differences and often overlapping boundaries. The increased model capacity of B4 was necessary to capture these fine-grained distinctions. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B4 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distributions for each wound characteristic
  • Binary classification thresholds to convert probabilities to binary masks.

Performance Results​

Performance evaluated using IoU and F1-Score compared to expert consensus.

  • Wound Bed
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.88 (0.74, 0.90)109≥ 0.68PASS
F10.92 (0.82, 0.94)109≥ 0.76 PASS
  • Bone/Cartilage/Tendon
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.63 (0.57, 0.70)109≥ 0.48PASS
F10.67 (0.59, 0.75)109≥ 0.49 PASS
  • Necrosis
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.62 (0.55, 0.68)109≥ 0.58PASS
F10.67 (0.60, 0.73)109≥ 0.60 PASS
  • Orthopedic Material
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.59 (0.51, 0.67)109≥ 0.46PASS
F10.61 (0.53, 0.71)109≥ 0.46 PASS
  • Maceration
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.51 (0.46, 0.56)109≥ 0.50PASS
F10.54 (0.48, 0.60)109≥ 0.52 PASS
  • Biofilm/Slough
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.50 (0.41, 0.59)109≥ 0.59PASS
F10.56 (0.47, 0.65)109≥ 0.64 PASS
  • Granulation Tissue
MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU0.63 (0.57, 0.70)109≥ 0.49PASS
F10.67 (0.59, 0.75)109≥ 0.52 PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Predicted wound segmentation masks
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: IoU, Accuracy, F1-score, with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • IoU calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

The model's segmentation performance, evaluated using both IoU and F1-Score, demonstrates successful capability across all tested wound bed components, consistently exceeding the predefined Success Criterion established by expert consensus. For the primary category, Wound Bed, the model achieved exceptionally high metrics, with a mean IoU of 0.880.880.88 (95% CI: 0.740.740.74, 0.900.900.90) and a mean F1-Score of 0.920.920.92 (95% CI: 0.82,0.940.82, 0.940.82,0.94). Strong performance is also noted for challenging, yet crucial categories like Bone/Cartilage/Tendon and Granulation Tissue, for all four metrics (IoU and F1-Score) are well above their respective criteria. Even for metrics with closer values, such as the IoU for Necrosis, the mean of 0.620.620.62 is still above the Success Criterion 0.580.580.58. The instance where the mean is below the criterion (0.59) is for Biofilm/Slough, yet the upper CI (0.590.590.59) is equal to the criterion. This comprehensive success across diverse tissues confirms the model's robustness and accuracy in clinically relevant segmentation tasks.

Image example of the model output:

To visualize the model's segmentation performance, below are example images showcasing the wound area segmentation:

Biofilm/Slough example: biofilm_esfacelos

Orthopedic Material example: ortopedico

Bias Analysis and Fairness Evaluation​

Objective: Ensure surface quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • IoU calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

  • Fitzpatrick I-II
ClassIoU ThresholdF1 ThresholdSuc. Cr. IoUSuc. Cr. F1Outcome# samples# with lesion
Wound Bed0.88 (0.72, 0.90)0.92 (0.81, 0.94)≥ 0.68≥ 0.76PASS6260
Bone/Cartilage/Tendon0.56 (0.50, 0.65)0.58 (0.50, 0.69)≥ 0.48≥ 0.49PASS599
Necrosis0.55 (0.47, 0.64)0.61 (0.52, 0.70)≥ 0.58≥ 0.60PASS5923
Orthopedic Material0.64 (0.48, 0.89)0.67 (0.49, 0.94)≥ 0.46≥ 0.46PASS544
Maceration0.54 (0.47, 0.60)0.57 (0.49, 0.64)≥ 0.50≥ 0.52PASS6019
Biofilm/Slough0.49 (0.34, 0.61)0.54 (0.40, 0.67)≥ 0.59≥ 0.64PASS5641
Granulation Tissue0.42 (0.28, 0.54)0.46 (0.32, 0.59)≥ 0.49≥ 0.52PASS5031
  • Fitzpatrick III-IV
ClassIoU ThresholdF1 ThresholdSuc. Cr. IoUSuc. Cr. F1Outcome# samples# with lesion
Wound Bed0.77 (0.70, 0.83)0.85 (0.79, 0.90)≥ 0.68≥ 0.76PASS4343
Bone/Cartilage/Tendon0.71 (0.62, 0.80)0.77 (0.67, 0.86)≥ 0.48≥ 0.49PASS429
Necrosis0.70 (0.63, 0.77)0.76 (0.69, 0.83)≥ 0.58≥ 0.60PASS4720
Orthopedic Material0.56 (0.50, 0.68)0.58 (0.50, 0.72)≥ 0.46≥ 0.46PASS496
Maceration0.49 (0.41, 0.57)0.52 (0.43, 0.62)≥ 0.50≥ 0.52PASS4312
Biofilm/Slough0.51 (0.38, 0.63)0.57 (0.44, 0.69)≥ 0.59≥ 0.64PASS5034
Granulation Tissue0.45 (0.31, 0.59)0.50 (0.36, 0.64)≥ 0.49≥ 0.52PASS4630
  • Fitzpatrick V-VI
ClassIoU ThresholdF1 ThresholdSuc. Cr. IoUSuc. Cr. F1Outcome# samples# with lesion
Wound Bed0.68 (0.24, 0.93)0.72 (0.26, 0.96)≥ 0.68≥ 0.76PASS44
Bone/Cartilage/Tendon1.0 (1.0, 1.0)1.0 (1.0, 1.0)≥ 0.48≥ 0.49PASS (not meaningful)80
Necrosis0.44 (0.0, 0.87)0.47 (0.0, 0.93)≥ 0.58≥ 0.60PASS22
Orthopedic Material0.56 (0.56, 1.0)0.60 (0.61, 1.0)≥ 0.46≥ 0.46PASS61
Maceration0.33 (0.04, 0.63)0.36 (0.06, 0.69)≥ 0.50≥ 0.52PASS73
Biofilm/Slough0.48 (0.30, 0.85)0.62 (0.46, 0.92)≥ 0.59≥ 0.64PASS33
Granulation Tissue0.53 (0.33, 0.69)0.56 (0.35, 0.74)≥ 0.49≥ 0.52PASS137

Bias Analysis Conclusion:

The model's segmentation performance, evaluated using both IoU and F1-Score across distinct wound components, consistently demonstrates success in meeting the expert-derived Success Criterion thresholds for all Fitzpatrick scale groups.

For the Fitzpatrick I-II group, highly reliable performance is observed, particularly for the Wound Bed component, where both IoU (0.880.880.88) and F1-Score (0.920.920.92) are significantly above their respective Success Criteria (0.680.680.68 and 0.760.760.76). For the Fitzpatrick III-IV group, the model maintains robust performance, with the Wound Bed IoU (0.770.770.77) and F1-Score (0.850.850.85) again exceeding the Success Criteria. Similar to the previous group, all other components show mean and upper CI values that are comfortably above the established thresholds. However, the smaller sample sizes for certain components warrant cautious interpretation, although the model still meets the criteria, and are the cause of a wider confidence interval. For the Fitzpatrick V-VI group, despite the limited sample size, the model achieves satisfactory results. The Wound Bed IoU (0.680.680.68) meets the Success Criterion exactly. For other components, the model consistently meets or exceeds the criteria, with upper CI values above the thresholds. The small sample sizes in this group lead to wider confidence intervals, indicating greater uncertainty in these estimates. In particular, the metrics for Bone/Cartilage/Tendon are not meaningful due to the absence of lesions in the test samples.

Overall, the model demonstrates equitable performance across all Fitzpatrick skin types. The consistent success in meeting the Success Criteria across all groups indicates that the model is robust and generalizes well across diverse skin tones, effectively mitigating potential biases. However, the limited sample sizes in the Fitzpatrick V-VI group highlight the need for further data collection to enhance confidence in these results.

Erythema Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Erythema Surface Quantification section

This model segments erythematous areas for inflammation extent assessment in various dermatological conditions.

Clinical Significance: Erythema area quantification aids in severity scoring and treatment monitoring.

Data Requirements and Annotation​

Model-specific annotation: Polygon annotations for affected areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts traced precise boundaries of affected skin:

  • Polygon tool for accurate edge delineation
  • Separate polygons for non-contiguous patches
  • High spatial precision for reliable area calculation
  • Multi-annotator consensus for boundary agreement

Dataset statistics:

  • Images with erythema segmentation annotations: 3088 images
  • Training set: 90% of the erythema images plus 10% of healthy skin images
  • Validation set: 10% of the erythema images
  • Test set: 10% of the erythema images
  • Conditions: Various dermatological conditions with erythema (e.g., psoriasis, atopic dermatitis, wound healing)

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output for each wound characteristic.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 512 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 512x512 pixels providing the best balance of performance and computational efficiency. EfficientNet-B2 was selected over larger variants (B3, B4) because erythema segmentation is a binary task (erythematous vs. non-erythematous skin) where the primary visual feature is color change (redness), which does not require the additional model capacity of larger architectures. The larger training dataset (3088 images) also allowed effective training with the more efficient B2 architecture. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B2 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distributions for each wound characteristic
  • Binary classification thresholds to convert probabilities to binary masks.

Performance Results​

Success criteria:

The model must achieve the following segmentation performance on the test set:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU Model0.768 (0.744, 0.79)308≥ 0.61PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert annotations
  • Processing: Model inference with probability distribution output
  • Output: Predicted erythema segmentation masks
  • Reference standard: Consensus masks from expert annotators
  • Statistical analysis: IoU, Accuracy, and F1-score with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • IoU calculation with Confidence Intervals between predicted and reference standard masks
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall IoU of 0.768 (95% CI: 0.744, 0.79) on the test set of 308 images.

Image example of the model output:

To visualize the model's segmentation performance, below is an example image showcasing the erythema segmentation output:

![Erythema Segmentation Example](./images/seg/inflammatory/erythema annulare centrifugum_6.jpg)

Bias Analysis and Fairness Evaluation​

Objective: Ensure erythema surface quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IOU Fitzpatrick I-II0.746 (0.713, 0.778)151≥ 0.61PASS
IOU Fitzpatrick III-IV0.800 (0.767, 0.832)125≥ 0.61PASS
IOU Fitzpatrick V-VI0.749 (0.664, 0.824)32≥ 0.61PASS

Bias Analysis Conclusion:

The model demonstrated excellent performance across all Fitzpatrick Skin Type groups, successfully meeting the Success Criterion of ≥0.61\ge 0.61≥0.61 for the IOU. A key strength is that the 95% Confidence Interval (CI) for each FST group entirely surpassed the ≥0.61\ge 0.61≥0.61 criterion. Moreover, the lower bound of the 95% CI was 0.7130.7130.713 for Fitzpatrick I-II, 0.7670.7670.767 for Fitzpatrick III-IV, and 0.6640.6640.664 for Fitzpatrick V-VI all above the Success Criterion. This consistent performance indicates a high degree of generalizability and low bias across the Fitzpatrick spectrum, reinforcing the conclusion of PASS for all evaluated groups.

Hair Loss Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Hair Loss Surface Quantification section

This model segments areas of hair loss for alopecia severity assessment and treatment monitoring.

Clinical Significance: Hair loss area quantification is critical for alopecia areata severity scoring (SALT score).

Data Requirements and Annotation​

Model-specific annotation: Polygon annotations for affected areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Model-specific annotation: Extent annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Medical experts traced precise boundaries of affected skin:

  • Polygon tool for accurate edge delineation
  • Separate polygons for non-contiguous patches
  • High spatial precision for reliable area calculation
  • Multi-annotator consensus for boundary agreement

Dataset statistics:

  • Images with hair loss segmentation annotations: 1826 images
  • Training set: 1026 of alopecia images
  • Validation set: 10% of the training images
  • Test set: 800 of alopecia images
  • Conditions: Various alopecia types (e.g., alopecia areata, androgenetic alopecia)

Training Methodology​

Architecture: EfficientNet-B2, a convolutional neural network optimized for image classification tasks with a final three-class output layer adapted for classes: background, scalp without hair loss, scalp with hair loss.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 272 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B2 at 272x272 pixels providing the best balance of performance and computational efficiency. EfficientNet-B2 was selected over larger variants (B3, B4) because hair loss segmentation involves a three-class task with relatively distinct visual features (scalp texture vs. hair-covered areas), which does not require the additional model capacity of larger architectures. The lower input resolution (272x272) was sufficient for this task due to the macro-scale nature of hair loss patterns on the scalp. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B2 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Softmax activation to obtain probability distributions for each class
  • Argmax to convert probabilities to class labels
  • Percentage area calculation for hair loss quantification
  • Aggregation of percentages from 4 head views (front, back, left, right)

Performance Results​

Performance evaluated using Relative Mean Absolute Error (RMAE) compared to expert consensus.

Success criterion: RMAE ≤ 9.6%

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model RMAE7.08% (5.63%, 8.93%)800≤ 9.6%PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with expert reference standard
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert alopecia percentage annotations
  • Processing: Model inference with probability distribution output
  • Output: Predicted hair loss segmentation masks and percentage area calculations
  • Reference standard: Expert percentage area annotations
  • Statistical analysis: RMAE, with Confidence Intervals calculated using bootstrap resampling (2000 iterations), and IoU.
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • RMAE calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall RMAE of 7.08% (95% CI: 5.63%, 8.93%) on the test set of 800 samples. The model demonstrated robust hair loss quantification capabilities across diverse skin types and alopecia presentations, indicating its suitability for clinical application in hair loss surface quantification.

Image example of the model output:

To visualize the model's segmentation performance, below is an example image showcasing the hair loss segmentation output:

Hair Loss Segmentation Example

Bias Analysis and Fairness Evaluation​

Objective: Ensure hair loss surface quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis (Critical for erythema):

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Success criterion: Consistent RMAE across Fitzpatrick types within acceptable limits

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
RMAE Fitzpatrick I-II6.9% (4.85%, 9.66%)100≤ 9.6%PASS
RMAE Fitzpatrick III-IV7.23% (4.97%, 10.4%)86≤ 9.6%PASS
RMAE Fitzpatrick V-VI7.46% (3.64%, 12.4%)14≤ 9.6%PASS

Bias Analysis Conclusion:

The model's performance, assessed by the RMAE across all available Fitzpatrick scale categories, successfully meets the predefined Success Criterion of ≤9.6%\leq 9.6\%≤9.6% established by annotator variability. For the Fitzpatrick I-II group, the model achieved a mean RMAE of 6.9% with a 95% CI of (4.85%, 9.66%) significantly below the Success Criterion (9.6%). The Fitzpatrick III-IV group also demonstrates strong performance with a mean RMAE of 7.23% (95% CI: 4.97%, 10.4%) substantially below the Success Criterion (9.6%). Similarly, the Fitzpatrick V-VI group, despite having the smallest sample size, exhibits a mean RMAE of 7.46% (95% CI: 3.64%, 12.4%) below the Success Criterion (9.6%), and the model's mean RMAE (7.46%) is also comfortably below the criterion. Overall, the consistently low mean RMAE values and the satisfaction of the CI-based PASS criterion across all Fitzpatrick scale categories successfully demonstrate that the model achieves an error rate that is competitive with human annotator agreement, indicating minimal bias with respect to prediction error across the spectrum of skin tones.

Nail Lesion Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Nail Lesion Surface Quantification section

This model segments the nail plate and any visible nail lesion for nail lesion assessment.

Clinical Significance: Nail involvement percentage is used in some severity scores such as NAPSI (Nail Psoriasis Severity Index).

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping annotations were used to find 2479 images of hands and feet showing nails with and without visible lesions.

Model-specific annotation: Polygon annotation of the nail plate and affected nail areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Trained annotators labelled images of nails with and without visible lesions following standardized clinical annotation protocols. Annotations consisted of drawing segmentation masks (polygons) covering the nail plate and each affected nail area.

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training, validation, and test sets contain images from different patients.

  • Images of healthy nails: 634
  • Images of nails with visible lesions: 1845
  • Training set: 1787 images
  • Validation set: 326 images
  • Test set: 366 images
  • Total images: 2479

Training Methodology​

The best segmentation backbone and architecture were determined after a thorough exploration of the existing approaches suitable for the task at hand:

  • Backbones: EfficientNet, MobileNet, Resnet
  • Architecture: UNet, UNet++, FPN

Architecture: UNet segmentation network with a ResNet101 backbone

  • Deep learning model tailored for multi-class image segmentation (background, nail plate, nail lesion)
  • Transfer learning from pre-trained weights (ImageNet dataset)
  • Input size: 480x480 pixels

Training approach:

The model has been trained using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.0001
  • Batch size: 16
  • Training duration: 40 epochs

Pre-processing:

  • In the training stage, input images were cropped and/or resized to 480x480 pixels when needed. In the validation and test stage, the inputs were directly resized to 480x480 pixels.
  • Data augmentation: geometric, color, and light augmentations.

Post-processing:

  • Confidence threshold of 0.5 for each channel of the output mask to generate positive and negative pixel-level predictions for each class (background, nail plate, nail lesion).

Performance Results​

Performance is evaluated using Intersection over Union (IoU), also called the Jaccard index. The IoU is computed for nail plate and nail lesion classes. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria are defined as IoU ≥ 0.70 for overall nail plate and IoU ≥ 0.70 for nail lesion.

MetricResultSuccess CriterionOutcome
IoU (overall nail segmentation)0.8900 (95% CI: [0.8712-0.9061])≥ 0.80PASS
IoU (nail lesion segmentation)0.8195 (95% CI: [0.7934-0.8418])≥ 0.70PASS

Verification and Validation Protocol​

Test Design:

  • Compare predicted and ground truth segmentation masks of nail plates and nail lesions
  • Evaluation across diverse skin tones

Complete Test Protocol:

  • Input: RGB images from test set with nail plate and lesion annotations from trained professionals
  • Processing: Semantic segmentation inference
  • Output: Predicted binary probabilities for each class (nail plate and nail lesion) converted to binary outputs (0/1) using a confidence threshold of 0.50.
  • Ground truth: Expert-annotated segmentation masks
  • Statistical analysis: IoU (nail plate and nail lesion)

Data Analysis Methods:

  • IoU of nail plate and nail lesion masks with a confidence threshold of 0.50

Test Conclusions:

  • The model met all success criteria, demonstrating reliable segmentation of the nail plate and affected nail areas.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure nail segmentation performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark)
  • Success criterion: IoU ≥ 0.80 for overall nail segmentation and Iou ≥ 0.70 for lesion segmentation, for all Fitzpatrick types.
Fitzpatrick Skin TypeNo. imagesIoU (overall nail segmentation)IoU (nail lesion segmentation)
I-II2380.8787 (95% CI: [0.8568, 0.8997])0.8193 (95% CI: [0.7871, 0.8494])
III-IV730.9045 (95% CI: [0.8665, 0.9366])0.8331 (95% CI: [0.7708, 0.8873])
V-VI550.9214 (95% CI: [0.9012, 0.9392])0.8017 (95% CI: [0.7280, 0.8710])

Results Summary:

  • All Fitzpatrick skin types met the IoU success criteria.
  • The model performs consistently across different skin tones, indicating effective generalization.
  • Future data collection should prioritize expanding the dataset for underrepresented skin types to reduce confidence interval variability and improve overall model robustness.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training
  • Pre-training on diverse data to improve generalization

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types, with most success criteria met.

Hypopigmentation or Depigmentation Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Hypopigmentation or Depigmentation Surface Quantification section

This model segments hypopigmented or depigmented areas for vitiligo extent assessment and repigmentation tracking.

Clinical Significance: Depigmentation area is essential for assessing disease severity.

Data Requirements and Annotation​

Model-specific annotation: Polygon annotations for affected areas (R-TF-028-004 Data Annotation Instructions - Visual Signs)

Model-specific annotation: Extent Annotation (R-TF-028-024 Data Annotation Instructions - Non-clinical Data)

Medical experts traced precise boundaries of affected skin:

  • Polygon tool for accurate edge delineation
  • Separate polygons for non-contiguous patches
  • High spatial precision for reliable area calculation
  • Multi-annotator consensus for boundary agreement

Dataset statistics:

  • Images with hair loss segmentation annotations: 970 images
  • Training set: 90% of the hypopigmentation images plus 10% of healthy skin images
  • Validation set: 10% of the hypopigmentation images
  • Test set: 10% of the hypopigmentation images
  • Conditions: Vitiligo and other hypopigmentation disorders

Training Methodology​

Architecture: EfficientNet-B4, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 512 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B4 at 512x512 pixels providing the best balance of performance and computational efficiency. EfficientNet-B4 was selected over smaller variants (B2, B3) because hypopigmentation segmentation requires detection of subtle color variations that can be challenging to distinguish from normal skin tone variations, particularly across different Fitzpatrick skin types. The increased model capacity of B4 was necessary to capture these fine-grained pigmentation differences and ensure robust performance across diverse skin tones. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B4 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distributions
  • Binary classification thresholds to convert probabilities to binary masks.

Performance Results​

Performance evaluated using IoU compared to expert consensus.

Success criterion: IoU > 69% (performance based on scientific literature and expert consensus considering inter-observer variability in hypopigmentation segmentation tasks).

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model IoU0.712 (0.685, 0.737)194≥ 0.69PASS

Verification and Validation Protocol​

Test Design:

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Predicted wound segmentation masks
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: IoU, Accuracy, F1-score, with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • IoU calculation with Confidence Intervals: Relative Mean Absolute Error comparing model predictions to expert consensus
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall IoU of 0.712 (95% CI: 0.685, 0.737) on the test set of 194 samples. The model demonstrated robust segmentation capabilities across diverse skin types and hypopigmentation presentations, indicating its suitability for clinical application in hypopigmentation surface quantification.

Image example of the model output:

To visualize the model's segmentation performance, below is an example image showcasing the hypopigmentation segmentation output:

Hypopigmentation Segmentation Example

Bias Analysis and Fairness Evaluation​

Objective: Ensure hypopigmentation surface quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • RMAE calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IOU Fitzpatrick I-II0.69 (0.63, 0.74)64≥ 0.69 (0.56, 0.79)PASS
IOU Fitzpatrick III-IV0.72 (0.68, 0.76)93≥ 0.69 (0.56, 0.79)PASS
IOU Fitzpatrick V-VI0.74 (0.69, 0.779)37≥ 0.69 (0.56, 0.79)PASS

Bias Analysis Conclusion:

The model's performance, assessed using the Intersection over Union (IOU) metric across all available Fitzpatrick scale categories, successfully meets the predefined Success Criterion established by annotator variability. For the Fitzpatrick I-II group, the model achieved a mean IOU of 0.69 with a 95% Confidence Interval (CI) of (0.63, 0.74). Crucially, the lower bound of the model's 95% CI (0.63) is comfortably above the lower bound of the annotator's CI (0.56), satisfying the PASS criterion. Furthermore, the model's mean IOU (0.69) meets the Success Criterion (0.69). The performance is even stronger for the Fitzpatrick III-IV group, yielding a mean IOU of 0.72 (95% CI: 0.68, 0.76). Here, the lower bound of the model's CI (0.68) significantly exceeds the lower bound of the annotator's CI (0.56), and the model's mean IOU (0.72) also substantially surpasses the Success Criterion (0.69). The trend continues with the Fitzpatrick V-VI group, which showed the highest mean IOU of 0.74 (95% CI: 0.69, 0.779). For this category, the lower bound of the model's CI (0.69) not only meets the Success Criterion but is also well above the lower bound of the annotator's CI (0.56), while the model's mean IOU (0.74) also exceeds the Success Criterion (0.69). Overall, the model demonstrates consistently high segmentation agreement with a mean IOU that meets or exceeds the expert agreement criterion across all Fitzpatrick scale categories, strongly indicating minimal segmentation quality bias.

Hyperpigmentation Surface Quantification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Hyperpigmentation Surface Quantification section

This model segments hyperpigmented areas.

Clinical Significance: Hyperpigmentation area quantification aids in severity assessment and treatment monitoring.

Data Requirements and Annotation​

Model-specific annotation: The ISIC 2018 Task1 Challenge dataset annotations were used for hyperpigmentation segmentation. Images on the ISIC dataset containing hyperpigmented lesions were selected, and polygon annotations corresponding to hyperpigmented areas were converted into binary masks representing affected skin.

Dataset statistics:

  • Images with hyperpigmentation segmentation annotations: 3700 images
  • Training set: 90% of the hyperpigmentation images plus 10% of healthy skin images
  • Validation set: 10% of the hyperpigmentation images
  • Test set: 10% of the hyperpigmentation images
  • Conditions: Various pigmentation disorders

Training Methodology​

Architecture: EfficientNet-B4, a convolutional neural network optimized for image classification tasks with a final layer adapted for a binary output.

  • Transfer learning from pre-trained weights (ImageNet)
  • Input size: RGB images at 512 pixels resolution

Other architectures and resolutions were evaluated during model selection, with EfficientNet-B4 at 512x512 pixels providing the best balance of performance and computational efficiency. EfficientNet-B4 was selected over smaller variants (B2, B3) because hyperpigmentation segmentation requires detection of subtle color variations that can be challenging to distinguish from normal skin tone variations, particularly across different Fitzpatrick skin types. The increased model capacity of B4 was necessary to capture these fine-grained pigmentation differences and ensure robust performance across diverse skin tones. Lower resolutions led to loss of detail, while higher resolutions increased computational cost without significant performance gains. Vision Transformer architectures were also evaluated, showing lower performance likely due to the limited dataset size for this specific task.

Training approach:

  • Pre-processing: Normalization of input images to standard mean and std of the ImageNet dataset. Other normalizations were evaluated during model selection, with ImageNet normalization providing the best performance.
  • Data augmentation: Rotations, mirroring, color jittering, cropping, zoom-out, brightness/contrast adjustments, blur. The global color changes introduced by some augmentations (e.g., color jittering, brightness/contrast adjustments) were carefully tuned to avoid altering the visual appearance. A global augmentation intensity was evaluated to reduce overfitting while preserving the clinical sign characteristics and model performance.
  • Data sampler: Batch size 64, with balanced sampling to ensure uniform class distribution across intensity levels. Larger and smaller batch sizes were evaluated during model selection, with non-significant performance differences observed.
  • Class imbalance handling: Balanced sampling strategy to ensure uniform class distribution. Other strategies were evaluated during model selection (e.g., focal loss, weighted cross-entropy loss), with balanced sampling providing the best performance.
  • Backbone architecture: A DeepLabV3+ segmentation head was added on top of the EfficientNet-B4 backbone to perform pixel-wise segmentation. Other segmentation heads were evaluated during model selection (e.g., U-Net, FCN), with DeepLabV3+ providing the best performance likely due to its atrous spatial pyramid pooling module that captures multi-scale context.
  • Loss function: Combined Cross-entropy loss with logits and Jaccard loss. Associated weights were set based on a hyperparameter search. Weighted cross-entropy loss was evaluated during model selection, with no significant performance differences observed, as the balanced sampling strategy provided sufficient class balance to avoid the need for weighted loss.
  • Optimizer: AdamW with learning rate 0.001, betas (0.9, 0.999), weight decay 0. SGD and RMSProp optimizers were evaluated during model selection, with Adam providing the best convergence speed and final performance, likely due to the dataset size and complexity.
  • Training duration: 400 epochs. At this point, the model had fully converged with evaluation metrics on the validation set stabilizing.
  • Learning rate scheduler: StepLR with step size 1 epoch, and gamma to decay the learning rate to 1.e-2 the starting learning rate at the end of training. Other schedulers were evaluated during model selection (e.g., cosine annealing, ReduceLROnPlateau), with no significant performance differences observed.
  • Evaluation metrics: IoU, F1-score, accuracy, sensitivity, and specificity calculated on the validation set after each epoch to monitor training progress and select the best model based on validation IoU.
  • Model freezing: No freezing of layers was applied. Freezing strategies were evaluated during model selection, showing a negative impact on performance likely due to the domain gap between ImageNet and dermatology images.

Post-processing:

  • Sigmoid activation to obtain probability distributions
  • Binary classification thresholds to convert probabilities to binary masks.

Performance Results​

Performance evaluated using Intersection over Union (IoU) compared to expert consensus.

Success criterion: IoU ≥ 0.82 (performance based on expert inter-observer agreement)

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
Model IoU0.825 (0.809, 0.838)370≥ 0.82 (0.79, 0.88)PASS

Verification and Validation Protocol​

Test Design:

Model-specific annotation: Intensity annotation (R-TF-028-004 Data Annotation Instructions - Visual Signs)

  • Independent test set with multi-annotator reference standard (minimum 3 dermatologists per image)
  • Comparison against expert consensus (mean of expert scores) rounded to nearest integer
  • Evaluation across diverse Fitzpatrick skin types and severity levels

Complete Test Protocol:

  • Input: RGB images from test set with expert erythema intensity annotations
  • Processing: Model inference with probability distribution output
  • Output: Predicted wound segmentation masks
  • Reference standard: Consensus intensity score from multiple expert dermatologists
  • Statistical analysis: IoU, Accuracy, F1-score, with Confidence Intervals calculated using bootstrap resampling (2000 iterations).
  • Robustness checks were performed to ensure consistent performance across several image transformations that do not alter the clinical sign appearance and simulate real-world variations (rotations, brightness/contrast adjustments, zoom, and image quality).

Data Analysis Methods:

  • IoU calculation with Confidence Intervals: Intersection over Union comparing model segmentation predictions to expert consensus masks
  • Inter-observer variability measurement
  • Bootstrap resampling (2000 iterations) for 95% confidence intervals

Test Conclusions:

Model performance met the predefined success criterion with an overall IoU of 0.825 (95% CI: 0.809, 0.838) on the test set of 370 samples. The model demonstrated robust segmentation capabilities across diverse skin types and hyperpigmentation presentations, indicating its suitability for clinical application in hyperpigmentation surface quantification.

Image example of the model output:

To visualize the model's segmentation performance, below is an example image showcasing the hyperpigmentation segmentation output:

Hyperpigmentation Segmentation Example

Bias Analysis and Fairness Evaluation​

Objective: Ensure hyperpigmentation surface quantification performs consistently across demographic subpopulations, with special attention to Fitzpatrick skin types.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • IoU calculation per Fitzpatrick type (I-II, III-IV, V-VI)
  • Comparison of model performance vs. expert inter-observer variability per skin type

Bias Mitigation Strategies:

  • Training data balanced across Fitzpatrick types

Results Summary:

MetricResult: Mean (95% CI)# samplesSuccess CriterionOutcome
IoU Fitzpatrick I-II0.822 (0.806, 0.837)352≥ 0.82 (0.79, 0.88)PASS
IoU Fitzpatrick III-IV0.885 (0.85, 0.917)18≥ 0.82 (0.79, 0.88)PASS
IoU Fitzpatrick V-VIN/A0N/AN/A

Bias Analysis Conclusion:

The model's performance, assessed using the Intersection over Union (IOU) metric, demonstrates a successful and robust classification capability across the available Fitzpatrick scale categories, exceeding the predefined Success Criterion established by annotator variability. For the Fitzpatrick I-II group, the model achieved a mean IOU of 0.822 with a 95% Confidence Interval (CI) of (0.806, 0.837). Crucially, the lower bound of the model's 95% CI (0.806) is above the lower bound of the annotator's CI (0.79), indicating that the model's segmentation agreement is consistently within the range of expert agreement. Furthermore, the model's mean IOU (0.822) meets the Success Criterion (0.82), confirming performance on par with expert human segmentation agreement. Even more impressively, the Fitzpatrick III-IV group, despite a smaller sample size, yielded a stronger model mean IOU of 0.885 (95% CI: 0.85, 0.917). For this group, the entire model's 95% CI (0.85, 0.917) is substantially above the lower bound of the annotator's CI (0.79), and the model's mean IOU (0.885) significantly exceeds the Success Criterion (0.82). This successful performance across both available groups suggests the model exhibits a high level of agreement and minimal bias with respect to segmentation quality across different Fitzpatrick scale categories. The analysis for the Fitzpatrick V-VI group is currently precluded due to a lack of samples.

Skin Surface Segmentation​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Skin Surface Segmentation section

This model segments skin regions to distinguish skin (including lesions, lips, shallow hair, etc.) from non-skin areas (including clothing, background, dense hair, etc.).

Clinical Significance: Accurate skin segmentation is a prerequisite for calculating lesion percentages relative to visible skin area.

Data Requirements and Annotation​

Compiled dataset: 50366 images divided into two sets:

  • clinical-set, composed of 18034 clinical and dermatoscopic images sourced from the ICD-11 dataset to cover a diverse range of skin conditions, body parts, and skin tones.
  • non-clinical-set, composed of 32332 non-dermatology-related images sourced from the Describable Texture Dataset, HGR, Schmugge, SFA, FSD, TexturePatch, abdominal, fashionpedia and humanparsing datasets.

Model-specific annotation: Extent Annotation (R-TF-028-024 Data Annotation Instructions - Non-clinical Data)

  • Images are annotated with a binary mask where 1 represents skin and 0 represents non-skin regions.
  • Skin regions include healthy skin, lips, ears, nails, tattoos, skin lesions, low hair density areas where skin is visible (excluding scalp hair), skin visible through transparent glass lenses, watermarks placed over skin, skin from multiple persons, and marks or circles painted or drawn over the skin.
  • Non-skin regions include background pixels, clothes, jewellery, glasses, eyes, teeth, eyebrows, scalp, dense hair (head hair, dense beards, etc.), medical material or instruments (forceps, gauze, plasters, etc.), surgical gloves, anonymisation bands, watermarks (if over non-skin), and dermatoscope shadow.
  • The clinical-set contains both clinical and dermatoscopic images.
  • The non-clinical-set contains non-dermatology related images.
  • The clinical-set is annotated by trained personnel with the above specifications. Each image is annotated by a single annotator.
  • Annotations for the non-clinical-set are sourced from their original authors. Original mask annotations are cleaned and standardized to match the above specifications. This standardization includes refining the lip, eyes, teeth, eyebrows, and nose holes. Images with minimal skin coverage were not included in this set.

Dataset statistics:

The dataset is split in train and validation sets. The dataset is split at patient level to avoid data leakage when subject information is available.

  • Images: 50366
  • Train and validation sets contain 41074 and 9292 images respectively.
  • Images can be clinical, dermatoscopic, or non-clinical and span a broad range of skin conditions, body parts, and skin tones.

Training Methodology​

The model architecture and training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different image encoders (e.g., ConvNext and EfficientNet of different sizes), decoders (e.g., UNet, UNet++, and FPN), and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate). The final configuration was chosen as the best trade-off between performance and runtime efficiency.

Architecture:

The model is a binary semantic segmentation network designed to distinguish skin regions from non-skin areas. It uses an encoder-decoder architecture with skip connections.

  • Encoder (Backbone):
    • Model: EfficientNet-B1 (timm-efficientnet-b1)
    • Pre-training: ImageNet weights
  • Decoder:
    • UNet++ decoder architecture with nested skip pathways
    • Progressively upsamples encoder features to reconstruct segmentation masks
    • Dense skip connections enable multi-scale feature fusion
  • Segmentation Head:
    • Final layer producing pixel-wise predictions
    • Output: Single-channel binary mask (1 class)
    • Predicts probability of each pixel being skin vs. non-skin
  • Input/Output Specifications:
    • Input channels: 3 (RGB images)
    • Output channels: 1 (binary segmentation)
    • Input size: 384×384 pixels

The model is implemented with PyTorch and the segmentation_models_pytorch (SMP) library.

Training approach:

The training process employs a three-stage progressive training strategy, starting with a frozen encoder backbone, followed by full model fine-tuning, and concluding with a focused last phase using a refined dataset. The approach incorporates weighted dataset sampling, data augmentation, and mixed-precision training.

  • Training Stages:
    • Stage 1 (Frozen Encoder): Trains only the decoder and segmentation head for 14 epochs while keeping the encoder frozen.
    • Stage 2 (Full Fine-tuning): Unfreezes the entire model and trains for 30 epochs with differential learning rates (encoder uses linear decay from base LR to 1×10⁻⁸).
    • Stage 3 (Last Phase Refinement): Continues training for 40 additional epochs using a refined dataset composition which excluded the datasets that likely introduce noise into the training process.
    • Training imges are sampled with a weighted strategy to ensure a balanced representation of the clinical images in the learning process.
    • Data Filtering: Excludes images with minimal skin coverage, images with more than 1 detected person, and manually identified mislabeled samples.
  • Optimization:
    • Optimizer: AdamW Schedule-Free with weight decay (0.001)
    • Learning Rate: 0.005 with 3-epoch warmup (converted to step-based warmup)
    • Differential Learning Rates: Encoder uses linear decay to 1×10⁻⁸; decoder and segmentation head maintain base learning rate
    • Gradient Clipping: Gradients clipped to norm of 0.5
    • Batch Size: 64
    • Mixed Precision: Enabled using automatic mixed precision (AMP)
  • Loss Function:
    • Combined Dice Loss and Binary Cross-Entropy (BCE)
    • Optimizes pixel-wise segmentation accuracy and boundary delineation

Data Pre-processing and Augmentation:

  • Geometric Transformations: random shift, scale, and rotation, random resized crop, zoom-out augmentation, horizontal flip.
  • Light, saturation, contrast, and color augmentations.
  • Image Normalization
  • Image Resizing: Longest side resized to 384 pixels, then padded to 384×384 square with constant padding
  • Batch Size: 64 images per batch

Validation images receive only resizing, padding, and normalization without augmentation.

Post-processing:

  • Segmentation masks are generated by thresholding the model's output probabilities
  • Binary predictions: pixels with probability ≥0.5 classified as skin, otherwise as non-skin

Performance Results​

Performance is evaluated using Intersection over Union (IoU) and the F1-score compared to expert-annotated reference standard skin masks. Success criteria is set as the average performance of SOTA models.

MetricResultSuccess CriterionOutcome
IoU0.97 (0.97-0.97)≥ 0.83PASS
F1-score0.98 (0.98-0.98)≥ 0.84PASS

Verification and Validation Protocol​

Test Design:

  • 4515 clinical images from the validation split of the clinical-set dataset which contains reliable mask annotations.
  • Images are annotated by trained personnel.
  • These images represent diverse skin conditions, anatomical sites, lighting conditions, and skin tone spectrums.

Complete Test Protocol:

  • Input: Images of skin.
  • Pre-processing: Image resizing to 384x384 pixels and normalization.
  • Processing: Skin segmentation model inference.
  • Output: Predicted binary mask with confidence scores.
  • Reference standard: Expert-annotated binary mask.
  • Statistical analysis: IoU and F1-score.

Data Analysis Methods:

  • IoU.
  • F1-score.
  • Binary mask visualization.

Test Conclusions:

  • The model met all success criteria, demonstrating reliable skin segmentation.
  • The model demonstrates non-inferiority with respect to SOTA models.
  • The model's performance is within acceptable limits.
  • The model showed robustness across different imaging conditions, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Validation ensures accurate identification across the full Fitzpatrick spectrum.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Tone Analysis:

  • Performance stratified by Fitzpatrick skin tones: I-II (light), III-IV (medium), V-VI (dark).
  • Metrics evaluated: IoU and F1-score.
  • Fitzpatrick success criteria: IoU ≥ 0.83; F1-score ≥ 0.84.
SubpopulationNum. training imagesNum. validation imagesIoUF1-scoreOutcome
Fitzpatrick I-II2122818490.96 (0.96-0.97)0.98 (0.98-0.98)PASS
Fitzpatrick III-IV1397917980.98 (0.97-0.98)0.99 (0.99-0.99)PASS
Fitzpatrick V-VI58678680.97 (0.96-0.97)0.98 (0.98-0.98)PASS

Results Summary:

  • The model met all success criteria, demonstrating reliable skin surface segmentation.
  • The model presents consistent robustness across all skin tone subpopulations.
  • The model demonstrates non-inferiority with respect to SOTA models.
  • The model's performance is within acceptable limits.

Bias Mitigation Strategies:

  • Image augmentation including geometric, contrast, saturation, and color augmentations.
  • Weighted dataset sampling to ensure balanced representation of image conditions in the learning process.
  • Pre-training on diverse data to improve generalization.
  • Three-stage progressive training strategy to adapt the pre-trained encoder to the segmentation task.

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across different subpopulations.
  • The model met all success criteria, demonstrating reliable skin surface segmentation.

Follicular and Inflammatory Pattern Identification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Follicular and Inflammatory Pattern Identification section

This model identifies three hidradenitis suppurativa (HS) patterns corresponding to the three phenotypes defined by the Martorell classification system (follicular, inflammatory, mixed).

Clinical Significance: Essential for diagnosing and characterizing follicular and inflammatory dermatoses and differentiating HS phenotypes.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping annotations were used to find 1259 images of hidradenitis suppurativa and 504 images of clear skin with no visible HS phenotype patterns.

Model-specific annotation: Each image was categorized as either one of the three possible HS phenotypes or the "no phenotype" supporting class. The annotation procedure for ordinal and categorical classification tasks is defined in R-TF-028-004 Data Annotation Instructions - Visual Signs.

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training, validation, and test sets contain images from different patients.

  • Images of follicular phenotype: 271
  • Images of inflammatory phenotype: 504
  • Images of mixed phenotype: 484
  • Images of clear skin: 504
  • Training set: 1248 images
  • Validation set: 257 images
  • Test set: 258 images
  • Total images: 1763

Training Methodology​

The best segmentation backbone and architecture were determined after a thorough evaluation of several backbones suitable for the task at hand: EfficientNet, MobileNet, Resnet, ConvNext

Architecture: ConvNext V2 (base size)

  • Deep learning model tailored for multi-class classification (follicular, mixed, inflammatory)
  • Transfer learning from pre-trained weights (ImageNet dataset)
  • Input size: 384x384 pixels

Given the complexity of the "Mixed" phenotype class, which includes both "Follicular" and "Inflammatory" patterns, the model was built for 2-class multi-label classification: it predicts whether follicular and/or inflammatory patterns are present or absent in the image. The probabilities are converted to binary outputs (1, positive or present; 0, negative or absent) using a probability threshold (ttt). A given pattern will be considered to be present if its corresponding probability is greater than or equal to the threshold ttt.

Based on these predicted binary outputs, the final class can be derived:

  • [0, 0] --> No phenotype visible
  • [1, 0] --> Follicular pattern
  • [0, 1] --> Inflammatory pattern
  • [1, 1] --> Mixed phenotype

Training approach:

The model has been trained using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.0001 and one-cycle learning rate scheduling for faster convergence
  • Batch size: 32
  • Training duration: 40 epochs

Pre-processing:

  • In the training stage, input images were resized to 384x384 pixels via random cropping and resizing. In the validation and test stage, the inputs were directly resized to 384x384 pixels.
  • Data augmentation: geometric, color, and light augmentations.

Performance Results​

Performance is evaluated using Balanced Accuracy (BACC) and average F1 score. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria are defined as BACC ≥ 0.65 and F1 ≥ 0.65. A threshold search was conducted on the validation set to obtain the best value, and the following test results are obtained using that threshold (0.60).

MetricResultSuccess CriterionOutcome
BACC0.6837 (95% CI: [0.6287-0.7398])≥ 0.65PASS
Average F1 score0.6976 (95% CI: [0.6457-0.7526])≥ 0.65PASS

Verification and Validation Protocol​

Test Design:

  • Compare predicted and ground truth labels on a separate, unseen set of images.
  • Use a binary threshold ttt derived from a threshold search on the validation data.
  • Evaluation across diverse skin tones

Complete Test Protocol:

  • Input: RGB images of HS and skin with no visible lesions, annotated by trained professionals
  • Processing: Multi-label classification inference
  • Output: Predicted binary probabilities for each pattern (follicular and inflammatory) converted to binary outputs (0/1) using a confidence threshold of 0.65, and finally converted to a multi-class classification output (4 classes: no phenotype, follicular, inflammatory, mixed).
  • Ground truth: Expert-annotated labels
  • Statistical analysis: Balanced accuracy and average F1 score

Data Analysis Methods:

  • Balanced accuracy and average F1 score

Test Conclusions:

  • The model met all success criteria, demonstrating reliable identification of HS patterns according to the Martorell phenotypes.
  • The model showed robustness across different skin tones and severities, indicating generalizability.

Bias Analysis and Fairness Evaluation​

Objective: Ensure phenotype identification works consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark)
  • Success criterion: Balanced accuracy > 0.65 and average F1 score > 0.65, for all Fitzpatrick skin type groups.
Skin typeNumber of imagesBalanced AccuracyAverage F1-score
I-II1340.6536 (95% CI: [0.5700-0.7399])0.6632 (95% CI: [0.5836-0.7428])
III-IV1050.6954 (95% CI: [0.5971-0.7799])0.6976 (95% CI: [0.6038-0.7787])
V-VI190.6840 (95% CI: [0.5000-1.0000])0.9498 (95% CI: [0.8421-1.0000])

Results Summary:

  • All Fitzpatrick skin types met the success criteria.
  • Despite the current imbalance in skin tone representation, the model performs consistently across skin types, indicating effective generalization.
  • Future data collection should prioritize expanding the dataset for underrepresented skin types to reduce confidence interval variability and improve overall model robustness.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training
  • Pre-training on diverse data to improve generalization

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types, with all success criteria met.

Inflammatory Nodular Lesion Pattern Identification​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Inflammatory Pattern Identification section

This model identifies the Hurley stage and inflammatory pattern of inflammatory dermatological conditions.

Clinical Significance: Inflammatory affection categorization is essential for treatment planning and disease monitoring.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping, subset of 188 images from Manises-HS

Model-specific annotation: Image Categorization (R-TF-028-004 Data Annotation Instructions - Visual Signs)

A medical expert specialized in inflammatory nodular lesions categorized the images with:

  1. Hurley Stage Classification: One of four categories, including the three Hurley stages and a Clear category that relates to no inflammatory visual signs.
  2. Inflammatory Activity Classification: One of two categories, inflammatory or non-inflammatory.

Dataset statistics:

The dataset is split at patient level to avoid data leakage. The training and validation sets contain images from different patients.

  • Images: 188
  • Number of subjects: 188
  • Training set: 150 images of which, 148 contain valid Hurley annotations and 136 contain valid inflammatory activity annotations
  • Validation set: 38 images of which, 37 contain valid Hurley annotations and 36 contain valid inflammatory activity annotations

Training Methodology​

The model architecture and training hyperparameters were selected after a systematic hyperparameter tuning process. We compared different image encoders (e.g., ConvNext and EfficientNet of different sizes) and evaluated multiple data hyperparameters (e.g., input resolutions, augmentation strategies) and optimization configurations (e.g., batch size, learning rate, metric learning). The final configuration was chosen as the best trade-off between performance and runtime efficiency.

Architecture:

The model is a multi-task neural network designed to predict Hurley stages and inflammatory activity simultaneously, while also generating embeddings for metric learning. It uses a shared backbone and common projection head, branching into specific heads for each task.

  • Backbone (Encoder):
    • Model: ConvNext Small, pre-trained on the ImageNet dataset.
    • Regularization: dropout and drop path.
  • Common Projection Head:
    • A common processing block that maps encoder features to a shared latent space of 256 features.
    • Consists of a GELU activation, Dropout, and a Linear layer.
  • Task-Specific Heads: The model splits into two distinct branches, one for Hurley and one for Inflammatory Activity. Each branch receives the 256-dimensional output from the Common Projection Head and contains two sub-heads:
    • Classification Head:
      • A dedicated block (GELU, Dropout, Linear)
      • Output size: 4 for Hurley and 2 for Inflammatory Activity.
    • Metric Embedding Head:
      • A multi-layer perceptron (two sequential blocks of GELU, Dropout, and Linear layers) that outputs feature embeddings.
      • Output size: 256 features.
  • Weight Initialization:
    • Linear Layers: Xavier Normal initialization.
    • Biases: Initialized to zero.

The model is implemented with PyTorch and the Python timm library.

Training approach:

The training process employs a multi-task learning strategy, optimizing for both classification accuracy and embedding quality. It utilizes a two-stage approach, starting with a frozen backbone followed by full model fine-tuning. It also incorporates data augmentation and mixed-precision training.

  • Training Stages:
    • Stage 1 (Frozen Backbone): Trains only the projection and task-specific heads for 15 epochs.
    • Stage 2 (Fine-tuning): Trains the entire model for 30 epochs.
  • Optimization:
    • Optimizer: AdamW Schedule-Free with weight decay (0.01).
    • Base LR: 0.0025
    • Learning Rate: Includes a 4-epoch warmup. During fine-tuning, the backbone learning rate is scaled down (0.05x) relative to the heads.
    • Gradient Clipping: Gradients are clipped to a norm of 0.5.
    • Precision: Mixed precision training using BFloat16.
  • Loss Functions:
    • Classification: Cross-Entropy Loss, weighted to handle class imbalance.
    • Metric Learning: NTXentLoss combined with a Batch Easy-Hard Miner (selecting easy positives and hard negatives).

Pre-processing:

  • Augmentation: Includes geometric and color transformations.
  • Regularization: MixUp is applied to inputs and labels.
  • Input: Images are resized to 384x384 with a batch size of 32.

Post-processing:

  • Classification probabilities are computed applying the softmax operation over the classification logits.
  • Classification categories are selected as the ones with higher probability.

Performance Results​

Performance is evaluated using accuracy and Mean Absolute Error (MAE) to account for the correct Hurley stage and accuracy and AUC (ROC) to account for the correct inflammatory activity. Success criteria is set as Accuracy ≥ 40% and MAE ≤ 1 for Hurley staging and accuracy ≥ 70% and AUC (ROC) ≥ 0.70 for inflammatory activity classification.

MetricResultSuccess CriterionOutcome
Hurley Stage Accuracy0.63 (0.46-0.77)≥ 0.40PASS
Hurley MAE0.49 (0.29-0.77)≤ 1PASS
Inflammatory Activity Accuracy0.71 (0.57-0.86)≥ 0.70PASS
Inflammatory Activity AUC (ROC)0.71 (0.49-0.89)≥ 0.70PASS

Verification and Validation Protocol​

Test Design:

  • Subset of 35 images with both Hurley stage and inflammatory activity annotations.
  • Expert-annotator labels.
  • Evaluation across diverse skin tones.

Complete Test Protocol:

  • Input: RGB images from validation set with expert annotations
  • Processing: Image classification inference
  • Output: Classification probabilities and predicted categories
  • Reference standard: Expert-annotated categories
  • Statistical analysis: Accuracy, MAE, AUC (ROC)

Data Analysis Methods:

  • Confusion matrix
  • Accuracy, AUC (ROC), MAE

Test Conclusions:

  • The model's Hurley stage prediction meets all the success criteria, demonstrating reliable performance.
  • The model's Hurley stage prediction is within acceptable limits.
  • The model's inflammatory activity prediction's mean values meet all the success criteria, demonstrating sufficient performance.
  • The model's inflammatory activity prediction's confidence intervals do not meet the success criteria, suggesting the need for further data collection to improve the model learning and evaluation.

Bias Analysis and Fairness Evaluation​

Objective: Ensure Hurley stage and inflammatory activity classification performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Performance stratified by Fitzpatrick skin types: I-II (light), III-IV (medium), V-VI (dark).
  • Success criterion: Accuracy ≥ 0.40 and MAE ≤ 1 for Hurley staging; Accuracy ≥ 0.70 and AUC (ROC) ≥ 0.70 for inflammatory activity.
  • This evaluation includes a new set of extra 22 images created semi-automatically by translating the main evaluation set to darker Fitzpatrick skin types with the Nano Banana AI-tool. These images preserve the inflammatory nodular lesions but with a darker skin tone. This image set allows to evaluate the model's performance in Fitzpatrick V-VI skin types.
SubpopulationNum. training imagesNum. validation imagesHurley AccHurley MAEPattern AccPattern AUC (ROC)Outcome
Fitzpatrick I-II85200.60 (0.40-0.80)0.54 (0.25-0.90)0.70 (0.50-0.90)0.72 (0.40-0.93)PASS
Fitzpatrick III-IV68150.67 (0.40-0.87)0.40 (0.13-0.67)0.74 (0.53-0.93)0.71 (0.33-0.96)PASS
Fitzpatrick V-VI0220.45 (0.23-0.64)0.82 (0.45-1.23)0.77 (0.59-0.95)0.72 (0.53-0.90)PASS

Results Summary:

  • Hurley staging met all the success criteria across Fitzpatrick I-VI levels.
  • Hurley staging presents confidence intervals within the acceptable limits. Only the Fitzpatrick V-VI subpopulation presents confidence intervals outside the success criteria.
  • Inflammatory activity identification mean values met all the success criteria across Fitzpatrick I-VI levels.
  • Inflammatory activity identification confidence intervals exceed the acceptable limits, presumably due to the small number of images in the validation set.
  • Future data collection and annotation should prioritize expanding the dataset to ensure a sufficient number of images for all subpopulations, reduce confidence interval variability, and improve model robustness for edge cases.

Bias Mitigation Strategies:

  • Image augmentation including color, geometric and MixUp augmentations during training.
  • Class-balancing to ensure equal representation of all classes.
  • Use of metric learning to improve the model's ability to generalize to new data.
  • Pre-training on diverse data to improve generalization
  • Two-stage training to fit the model to the new data while benefiting from the image encoder pre-training.

Bias Analysis Conclusion:

  • The model demonstrated consistent performance across Fitzpatrick skin types with all success criteria met.
  • Inflammatory activity identification and Fitzpatrick V-VI subpopulations presented off-limits confidence intervals, highlighting the need for more data collection for more precise training and validation of the model.
  • More data collection is required to validate the model with higher precision, especially for the Fitzpatrick V-VI subpopulations.

Dermatology Image Quality Assessment (DIQA)​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - DIQA section

This model assesses image quality to filter out images unsuitable for clinical analysis, ensuring reliable downstream model performance.

Clinical Significance: DIQA is critical for patient safety by preventing low-quality images from being analyzed, which could lead to incorrect clinical assessments.

Data Requirements and Annotation​

Data Requirements: A dermatology image subset was selected from the main dataset, and was annotated for image quality assessment (IQA), as described in R-TF-028-004 Data Annotation Instructions - Non-clinical data. This IQA-specific dataset was then expanded with other non-clinical image quality assessment datasets: CID2013, TID2013, CID:IQ, LIVE-ItW, NITSIQA, KonIQ-10k, kadid-10k, GFIQA-20k, SPAQ, and BIQ2021.

Dataset statistics:

The dataset has a total size of 85561 images.

  • Images with artificial distortions: 18019

  • Images with real distortions: 67542

  • Non-dermatology images with quality ratings: 69058

  • Dermatology images with quality ratings: 16503

Training Methodology​

Architecture: EfficientNet-B0 pretrained on ImageNet. The default classification head was replaced with a regression head specifically designed for this IQA task.

Training approach:

  • Score regression: the predicted output is a single scalar value that represents perceived visual quality.
  • Loss function: Mean Squared Error (MSE). For a more stable training, the output of the model is compared to the normalized ground truth score.
  • Data augmentation: The usual image augmentation methods (e.g. color jittering, rotation, etc.) may break the relationship between the images and their corresponding quality scores, so we used a low-augmentation setting, with only horizontal flips and slight random crops. The goal is to introduce some variability without affecting the image-score relationship.
  • Training duration: 30 epochs with learning rate scheduling (cosine annealing).

Performance Results​

Success criteria:

  • Pearson correlation (PLCC) ≥ 0.70
  • Spearman correlation (SROCC) ≥ 0.70
MetricResultSuccess CriterionOutcome
Pearson correlation0.8959 (95% CI: [0.8910-0.9002])≥ 0.70PASS
Spearman correlation0.9030 (95% CI: [0.8982-0.9071])≥ 0.70PASS

Verification and Validation Protocol​

Test Design:

  • Test set with expert quality annotations across quality spectrum and acquisition settings

Complete Test Protocol:

  • Input: Images with varying quality levels
  • Processing: DIQA model inference
  • Output: The scalar output is rescaled to the [0, 10] quality score range.
  • Ground truth: Mean Opinion Scores (MOS) from annotation specialists
  • Statistical analysis: Pearson and Spearman correlation

Data Analysis Methods:

  • Pearson and Spearman correlation metrics are computed with confidence intervals using the bootstrapping method.

Test Conclusions:

  • The model met all success criteria, demonstrating excellent performance and strong correlation with the quality ratings of a diverse sample of human observers.

Bias Analysis and Fairness Evaluation​

To assess model and data bias, we selected the dermatology image subset of the dataset used for DIQA training and evaluated the model's predictions across Fitzpatrick skin types (FST).

Objective: Ensure DIQA performs consistently across populations without unfairly rejecting valid images.

Subpopulation Analysis Protocol:

1. Skin Type Analysis:

  • Consistency for different Fitzpatrick skin types
  • Ensure darker skin images aren't systematically rated lower quality

Bias Mitigation Strategies:

  • Training on diverse imaging conditions and device types
  • Balanced dataset across Fitzpatrick types, ensuring all distortions have occurrences on all demographic groups.

Results Summary:

Fitzpatrick skin typeNum. imagesPLCCSROCC
I-II9130.7551 (95% CI: [0.7192-0.7888])0.7640 (95% CI: [0.7310-0.7954])
III-IV6590.6884 (95% CI: [0.6435-0.7316])0.7065 (95% CI: [0.6620-0.7500])
V-VI1820.4736 (95% CI: [0.3634-0.5811])0.4649 (95% CI: [0.3470-0.5783])

Bias Analysis Conclusion: The model shows moderate to strong correlation metrics on darker skin tones, with some bias towards lighter skin tones. A closer inspection of these results revealed that most FST IV-VI images corresponded to bad, poor and fair quality samples, for which the model predicts higher quality scores, hence the lower correlation metrics. The moderate correlation, however, demonstrates that the model is capable of estimating visual quality in such groups.

Domain Validation​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Domain Validation section

This model verifies that input images are within the validated domain (dermatological images, including clinical and dermoscopic) vs. non-skin images, preventing clinical models from processing invalid inputs.

Clinical Significance: Critical safety function preventing misuse and ensuring clinical models only analyze appropriate dermatological images.

Data Requirements and Annotation​

Data Requirements:

A large subset of the dataset was reviewed and annotated to obtain domain-related labels, as described in R-TF-028-004 Data Annotation Instructions - Non-clinical data. Due to the heterogeneous nature of the dataset, it was possible to obtain labels of all three possible image types (clinical, dermoscopy, non-dermatology). As most images in the dataset are clinical or dermoscopic, the non-dermatology subset was expanded with external open image datasets, to account for as many examples of non-dermatology concepts as possible, such as:

  • Paintings, posters, sketches, and screenshots;
  • Retinal, MRI, colonoscopy, histology, and ultrasound images;
  • Everyday objects, pets, and wildlife.

Dataset statistics: The final curated dataset presented the following distribution:

LabelNo. images
Clinical588008
Dermoscopy125907
Non-dermatology163425
Total877340

Training Methodology​

Architecture: EfficientNet-B0

  • We used a model pretrained on ImageNet, discarding the original classification head and creating a new one for this three-class classification problem.
  • Input size: 224x224x3 pixels (RGB images)

Training approach:

  • Multi-class classification (clinical, dermoscopy, non-dermatology image)
  • Loss function: Multi-class cross-entropy
  • Class balancing: oversample the dermoscopy and non-dermatology images to balance
  • Training duration:
    • 5 epochs with frozen backbone to train the classification head only
    • 5 epochs with the entire model unfrozen

Performance Results​

Success criteria:

  • Sensitivity ≥ 0.95 (correctly identify valid dermatological images)
  • Specificity ≥ 0.99 (correctly reject non-dermatological images)
  • False positive rate ≤ 1% (minimize incorrect rejections)
MetricValueCriterionOutcome
Non-dermatology precision0.9855 (95% CI: [0.9828-0.9882])≥ 0.95PASS
Non-dermatology recall0.9978 (95% CI: [0.9967-0.9988])≥ 0.90PASS
Clinical f1-score0.9975 (95% CI: [0.9973-0.9978])≥ 0.90PASS
Dermoscopic f1-score0.9950 (95% CI: [0.9942-0.9957])≥ 0.90PASS
Accuracy0.9965 (95% CI: [0.9961-0.9969])≥ 95%PASS
Macro avg f1-score0.9947 (95% CI: [0.9940-0.9953])≥ 0.90PASS
Weighted avg f1-score0.9965 (95% CI: [0.9961-0.9969])≥ 0.90PASS

Verification and Validation Protocol​

Test Design:

  • A set of 81008 images, including clinical, dermoscopy, and non-dermatology images.
  • The dermatology image subset is heterogeneous in terms of sex, age, and skin type.

Complete Test Protocol:

  • Input: Mixed dataset of in-domain and out-of-domain images
  • Processing: Multi-class classification
  • Output: Probability vector
  • Ground truth: Expert-confirmed domain labels (clinical, dermoscopic, non-dermatology)
  • Statistical analysis: Precision, recall, F-1 score, accuracy with confidence intervals (bootstrap method).

Data Analysis Methods:

  • Precision, recall, F-1 score, accuracy (with bootstrap confidence intervals)

Test Conclusions:

  • The model met all success criteria, demonstrating excellent performance for all classes (clinical, dermoscopic, non-dermatology).
  • Due to the simplicity of the task, it is possible to leverage a very small and lightweight model (EfficientNet-B0) that is capable of learning to separate all three classes.

Bias Analysis and Fairness Evaluation​

To assess model and data bias, we filtered the previously mentioned set of 81008 images to keep only the clinical and dermoscopic images. The model's predictions were then evaluated across sex, age, and Fitzpatrick skin type (FST). The reason behind filtering this test set is that non-dermatology images cannot be categorized in terms of sex, age, and skin type.

Objective: Ensure domain validation doesn't unfairly reject valid dermatological images from any subpopulation.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Type Analysis:

  • Equal F-1 score across all Fitzpatrick types
  • Success criterion: No correlation between skin type and false rejection

2. Sex and Age Analysis:

  • Consistent performance across sex and age groups
  • Success criterion: No sex or age-specific rejection bias

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training.
  • Pre-training on diverse data to improve generalization.

Results Summary:

Groupclinical f1-scoredermoscopic f1-scoreaccuracyweighted avg f1-score
I-II0.9972 (95% CI: [0.9968-0.9976])0.9954 (95% CI: [0.9947-0.9961])0.9962 (95% CI: [0.9956-0.9967])0.9966 (95% CI: [0.9961-0.9971])
III-IV0.9990 (95% CI: [0.9987-0.9993])0.9915 (95% CI: [0.9875-0.9952])0.9982 (95% CI: [0.9976-0.9987])0.9987 (95% CI: [0.9983-0.9991])
V-VI0.9857 (95% CI: [0.9809-0.9902])0.7526 (95% CI: [0.5600-0.9092])0.9714 (95% CI: [0.9622-0.9801])0.9838 (95% CI: [0.9784-0.9887])
Groupclinical f1-scoredermoscopic f1-scoreaccuracyweighted avg f1-score
Newborn0.9790 (95% CI: [1.0000-1.0000])0.9990 (95% CI: [1.0000-1.0000])1.0000 (95% CI: [1.0000-1.0000])1.0000 (95% CI: [1.0000-1.0000])
Child1.0000 (95% CI: [1.0000-1.0000])1.0000 (95% CI: [1.0000-1.0000])1.0000 (95% CI: [1.0000-1.0000])1.0000 (95% CI: [1.0000-1.0000])
Adolescent0.9946 (95% CI: [0.9858-1.0000])0.9993 (95% CI: [0.9983-1.0000])0.9987 (95% CI: [0.9969-1.0000])0.9987 (95% CI: [0.9970-1.0000])
Adult0.9990 (95% CI: [0.9988-0.9993])0.9971 (95% CI: [0.9961-0.9979])0.9985 (95% CI: [0.9981-0.9989])0.9986 (95% CI: [0.9982-0.9990])
Geriatric0.9986 (95% CI: [0.9981-0.9990])0.9952 (95% CI: [0.9938-0.9966])0.9976 (95% CI: [0.9969-0.9983])0.9977 (95% CI: [0.9971-0.9984])
Groupclinical f1-scoredermoscopic f1-scoreaccuracyweighted avg f1-score
Female0.9981 (95% CI: [0.9977-0.9985])0.9958 (95% CI: [0.9947-0.9969])0.9973 (95% CI: [0.9966-0.9979])0.9975 (95% CI: [0.9968-0.9980])
Male0.9987 (95% CI: [0.9983-0.9990])0.9956 (95% CI: [0.9946-0.9966])0.9978 (95% CI: [0.9973-0.9983])0.9979 (95% CI: [0.9974-0.9984])

Bias Analysis Conclusion:

  • Overall, the model offers a robust performance across skin type, sex, and age.
  • The lower proportion of dermoscopy images of dark skin limits model performance for those specific demographic groups under that imaging modality.

Head Detection​

Model Overview​

Reference: R-TF-028-001 AI/ML Description - Head Detection section

This AI model detects and localizes human heads in images.

Clinical Significance: Automated head detection enables precise head surface analysis by ensuring proper head-centered framing.

Data Requirements and Annotation​

Foundational annotation: ICD-11 mapping (completed)

Model-specific annotation: Head detection (R-TF-028-024 Data Annotation Instructions - Non-clinical Data)

Images were annotated with tight rectangular bounding boxes around head regions. Each bounding box is defined by its four corner coordinates (x_min, y_min, x_max, y_max), delineating the region containing the head with minimal background.

Dataset statistics:

  • Images with head annotations: 826 images of head with and without skin pathologies
  • Training set: 661 images
  • Validation set: 165 images

Training Methodology​

Architecture: YOLOv8-S model

  • Deep learning model tailored for single-class object detection.
  • Transfer learning from pre-trained COCO weights
  • Input size: 480x480 pixels

Training approach:

The model has been trained with the Ultralytics framework using the following hyperparameters:

  • Optimizer: AdamW with learning rate 0.001 and cosine annealing scheduler
  • Batch size: 16
  • Training duration: 150 epochs with early stopping

Pre-processing:

  • Input images were resized and padded to 480x480 pixels.
  • Data augmentation: geometric, color, light, and mosaic augmentations.

Post-processing:

  • Confidence threshold of 0.25 to filter low-confidence predictions.
  • Non-maximum suppression (NMS) with IoU threshold of 0.7 to eliminate overlapping boxes.

Remaining hyperparameters are set to the default values of the Ultralytics framework.

Performance Results​

Performance is evaluated using mean Average Precision at IoU=0.5 (mAP@50) to account for correct head localization. Statistics are calculated with 95% confidence intervals using bootstrapping (1000 samples). Success criteria is defined as mAP@50 ≥ 0.86 to account for detection performance superior to the average performance of published head detection studies.

MetricResultSuccess CriterionOutcome
mAP@500.99 (0.99-0.99)≥ 0.86PASS

Verification and Validation Protocol​

Test Design:

  • Expert-annotated bounding boxes used as reference standard for validation.
  • Evaluation across diverse skin tones and image quality levels.

Complete Test Protocol:

  • Input: RGB images from validation set with expert head annotations
  • Processing: Object detection inference with NMS
  • Output: Predicted bounding boxes with confidence scores and head counts
  • Reference standard: Expert-annotated boxes
  • Statistical analysis: mAP@50 with 95% confidence intervals

Data Analysis Methods:

  • Precision-Recall and F1-confidence curves
  • mAP calculation at IoU=0.5 (mAP@50)
  • Mean Absolute Error (MAE) between predicted and reference standard head counts

Test Conclusions:

  • The model met all success criteria, demonstrating reliable head detection performance suitable for supporting image standardization workflows.
  • The model demonstrates superior performance to the average performance of previously published head detection studies.
  • The model's performance is within acceptable limits and shows excellent generalization.

Bias Analysis and Fairness Evaluation​

Objective: Ensure head detection performs consistently across demographic subpopulations.

Subpopulation Analysis Protocol:

1. Fitzpatrick Skin Tone Analysis:

  • Performance stratified by Fitzpatrick skin tones: I-II (light), III-IV (medium), V-VI (dark)
  • Success criterion: mAP@50 ≥ 0.86 for all skin tone groups
SubpopulationNum. training samplesNum. val samplesmAP@50Outcome
Fitzpatrick I-II3681020.99 (0.99-0.99)PASS
Fitzpatrick III-IV223440.99 (0.97-0.99)PASS
Fitzpatrick V-VI70190.99 (0.99-0.99)PASS

Results Summary:

  • The model demonstrated excellent performance across all Fitzpatrick skin tones, meeting all success criteria.
  • No significant performance disparities were observed among skin tone categories.
  • The model shows robust generalization across diverse skin tones.

Bias Mitigation Strategies:

  • Image augmentation including color and lighting variations during training
  • Pre-training on diverse data to improve generalization
  • Balanced representation of skin tones in the training dataset

Bias Analysis Conclusion:

  • The model demonstrated consistent and excellent performance across all Fitzpatrick skin tones, with all success criteria met.
  • No performance disparities were observed, indicating fairness in head detection across diverse populations.
  • The model is suitable for deployment in diverse clinical and telemedicine settings.

Summary and Conclusion​

The development and validation activities described in this report provide objective evidence that the AI algorithms for Legit.Health Plus meet their predefined specifications and performance requirements.

Status of model development and validation:

  • ICD Category Distribution and Binary Indicators: [Status to be updated]
  • Visual Sign Intensity Models: [Status to be updated]
  • Lesion Quantification Models: [Status to be updated]
  • Surface Area Models: [Status to be updated]
  • Non-Clinical Support Models: [Status to be updated]

The development process adhered to the company's QMS and followed Good Machine Learning Practices. Models meeting their success criteria are considered verified, validated, and suitable for release and integration into the Legit.Health Plus medical device.

State of the Art Compliance and Development Lifecycle​

Software Development Lifecycle Compliance​

The AI models in Legit.Health Plus were developed in accordance with state-of-the-art software development practices and international standards:

Applicable Standards and Guidelines:

  • IEC 62304:2006+AMD1:2015 - Medical device software lifecycle processes
  • ISO 13485:2016 - Quality management systems for medical devices
  • ISO 14971:2019 - Application of risk management to medical devices
  • ISO/IEC 25010:2011 - Systems and software quality requirements and evaluation (SQuaRE)
  • FDA Guidance on Software as a Medical Device (SAMD) - Clinical evaluation and predetermined change control plans
  • IMDRF/SaMD WG/N41 FINAL:2017 - Software as a Medical Device: Key Definitions
  • Good Machine Learning Practice (GMLP) - FDA/Health Canada/UK MHRA Guiding Principles (2021)
  • Proposed Regulatory Framework for Modifications to AI/ML-Based SaMD - FDA Discussion Paper (2019)

Development Lifecycle Phases Implemented:

  1. Requirements Analysis: Comprehensive AI model specifications defined in R-TF-028-001 AI/ML Description
  2. Development Planning: Structured development plan in R-TF-028-002 AI Development Plan
  3. Risk Management: AI-specific risk analysis in R-TF-028-011 AI Risk Matrix
  4. Design and Architecture: State-of-the-art architectures (Vision Transformers, CNNs, object detection, segmentation)
  5. Implementation: Following coding standards and version control practices
  6. Verification: Unit testing, integration testing, and algorithm validation
  7. Validation: Clinical performance testing against predefined success criteria
  8. Release: Version-controlled releases with complete traceability
  9. Maintenance: Post-market surveillance and performance monitoring

Version Control and Traceability:

  • All model versions tracked in version control systems (Git)
  • Complete traceability from requirements to validation results
  • Dataset versions documented with checksums and provenance
  • Model artifacts stored with complete training metadata
  • Documented change control process for model updates

State of the Art in AI Development​

Best Practices Implemented:

1. Data Management Excellence:

  • Multi-source data collection with demographic diversity
  • Rigorous data quality control and curation processes
  • Systematic annotation protocols with multi-expert consensus
  • Data partitioning strategies preventing data leakage
  • Sequestered test sets for unbiased evaluation

2. Model Architecture Selection:

  • Use of state-of-the-art architectures (Vision Transformers for classification, YOLO/Faster R-CNN for detection, U-Net/DeepLab for segmentation)
  • Transfer learning from large-scale pre-trained models
  • Architecture selection based on published benchmark performance
  • Justification of architecture choices documented per model

3. Training Best Practices:

  • Systematic hyperparameter optimization
  • Cross-validation and early stopping to prevent overfitting
  • Data augmentation for robustness and generalization
  • Multi-task learning where clinically appropriate
  • Monitoring of training metrics and convergence

4. Model Calibration and Post-Processing:

  • Temperature scaling for probability calibration
  • Test-time augmentation for robust predictions
  • Ensemble methods where applicable
  • Uncertainty quantification for model predictions

5. Comprehensive Validation:

  • Independent test sets never used during development
  • External validation on diverse datasets
  • Clinical reference standard from expert consensus
  • Statistical rigor with confidence intervals
  • Comprehensive subpopulation analysis

6. Bias Mitigation and Fairness:

  • Systematic bias analysis across demographic subpopulations
  • Fitzpatrick skin type stratification in all analyses
  • Data collection strategies ensuring demographic diversity
  • Bias monitoring models (DIQA, Fitzpatrick identification)
  • Transparent reporting of performance disparities

7. Explainability and Transparency:

  • Attention visualization for model interpretability (where applicable)
  • Clinical reasoning transparency (top-k predictions with probabilities)
  • Documentation of model limitations and known failure modes
  • Clear communication of uncertainty in predictions

Risk Management Throughout Lifecycle​

Risk Management Process:

Risk management is integrated throughout the entire AI development lifecycle following ISO 14971:

1. Risk Analysis:

  • Identification of AI-specific hazards (data bias, model errors, distribution shift)
  • Hazardous situation analysis (incorrect predictions leading to clinical harm)
  • Risk estimation combining probability and severity

2. Risk Evaluation:

  • Comparison of risks against predefined acceptability criteria
  • Benefit-risk analysis for each AI model
  • Clinical impact assessment of potential errors

3. Risk Control:

  • Inherent safety by design (offline models, no learning from deployment data)
  • Protective measures (DIQA filtering, domain validation, confidence thresholds)
  • Information for safety (user training, clinical decision support context)

4. Residual Risk Evaluation:

  • Assessment of risks after control measures
  • Verification that all risks reduced to acceptable levels
  • Overall residual risk acceptability

5. Risk Management Review:

  • Production and post-production information review
  • Update of risk management file
  • Traceability to safety risk matrix (R-TF-028-011 AI Risk Matrix)

AI-Specific Risk Controls:

  • Data Quality Risks: Multi-source collection, systematic annotation, quality control
  • Model Overfitting: Sequestered test sets, cross-validation, regularization
  • Bias and Fairness: Demographic diversity, subpopulation analysis, bias monitoring
  • Model Uncertainty: Calibration, confidence scores, uncertainty quantification
  • Distribution Shift: Domain validation, DIQA filtering, performance monitoring
  • Clinical Misinterpretation: Clear communication, clinical context, user training

Information Security​

Cybersecurity Considerations:

The AI models are designed with information security principles integrated throughout development:

1. Model Security:

  • Model parameters stored securely with access controls
  • Model integrity verification (checksums, digital signatures)
  • Protection against model extraction or reverse engineering
  • Secure deployment pipelines

2. Data Security:

  • Patient data protection throughout development (de-identification, anonymization)
  • Secure data storage with encryption at rest
  • Secure data transmission with encryption in transit
  • Access controls and audit logging for training data

3. Inference Security:

  • Secure API endpoints for model inference
  • Input validation to prevent adversarial attacks
  • Rate limiting and authentication
  • Output validation and sanity checking

4. Privacy Considerations:

  • No patient-identifiable information stored in models
  • Training data anonymization and de-identification
  • Compliance with GDPR, HIPAA, and applicable privacy regulations
  • Data minimization principles applied

5. Vulnerability Management:

  • Regular security assessments of AI infrastructure
  • Dependency scanning for software libraries
  • Patch management for underlying frameworks
  • Incident response procedures

6. Adversarial Robustness:

  • Consideration of adversarial attack scenarios
  • Input preprocessing to detect anomalous inputs
  • Domain validation to reject out-of-distribution inputs
  • DIQA filtering to reject manipulated or low-quality images

Cybersecurity Risk Assessment:

Cybersecurity risks are addressed in the overall device risk management file, including:

  • Threat modeling for AI components
  • Attack surface analysis
  • Mitigation strategies and security controls
  • Monitoring and incident response

Verification and Validation Strategy​

Verification Activities (confirming that the AI models implement their specifications):

  • Code reviews and static analysis
  • Unit testing of model components
  • Integration testing of model pipelines
  • Architecture validation against specifications
  • Performance benchmarking against target metrics

Validation Activities (confirming that AI models meet intended use):

  • Independent test set evaluation with sequestered data
  • External validation on diverse datasets
  • Clinical reference standard comparison
  • Subpopulation performance analysis
  • Real-world performance assessment
  • Usability and clinical workflow validation

Documentation of Verification and Validation:

Complete documentation is maintained for all verification and validation activities:

  • Test protocols with detailed methodology
  • Complete test results with statistical analysis
  • Data summaries and test conclusions
  • Traceability from requirements to test results
  • Identified deviations and their resolutions

This comprehensive approach ensures compliance with GSPR 17.2 requirements for software development in accordance with state of the art, incorporating development lifecycle management, risk management, information security, verification, and validation.

Integration Verification Package​

To ensure that the AI models produce identical outputs when integrated into the Legit.Health Plus software environment as they did during development and validation, an Integration Verification Package has been prepared for each model in accordance with GP-028 AI Development.

Purpose​

The Integration Verification Package enables the Software Development team to:

  • Verify that models are correctly integrated without alterations to their inference behavior
  • Detect any environment discrepancies that could affect model outputs
  • Provide objective evidence of output equivalence between development and production environments
  • Support regulatory compliance by demonstrating traceability between development validation and deployed system verification per IEC 62304

Package Location and Structure​

All Integration Verification Packages are stored in the secure, version-controlled S3 bucket with the following structure:

s3://legit-health-plus/integration-verification/
├── icd-category-distribution/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
├── erythema-intensity/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
├── desquamation-intensity/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
├── induration-intensity/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
├── pustule-intensity/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
├── [... additional models ...]
├── diqa/
│ ├── images/
│ ├── expected_outputs.csv
│ └── manifest.json
└── domain-validation/
├── images/
├── expected_outputs.csv
└── manifest.json

Package Contents Per Model​

For each AI model in the Legit.Health Plus device, the Integration Verification Package includes:

Reference Test Images​

  • Location: s3://legit-health-plus/integration-verification/{MODEL_NAME}/images/
  • Content: A curated subset of images from the model's held-out test set
  • Selection Criteria: Images representative of the model's input domain, including diverse conditions, demographics, and imaging modalities
  • Format: Original image format (JPEG/PNG) without additional processing

Expected Outputs File​

  • Location: s3://legit-health-plus/integration-verification/{MODEL_NAME}/expected_outputs.csv
  • Schema:
ColumnTypeDescription
image_idstringUnique identifier matching the image filename
expected_outputstring/floatModel's expected output (JSON-encoded for complex outputs)
output_typestringOutput category: classification_probability, regression_value, segmentation_mask_hash, detection_boxes
preprocessing_hashstringSHA-256 hash of the preprocessed input tensor
  • Generation: Outputs generated from the validated development model using the exact configuration documented in this report

Verification Manifest​

  • Location: s3://legit-health-plus/integration-verification/{MODEL_NAME}/manifest.json
  • Contents:
{
"model_name": "erythema-intensity",
"model_version": "1.0.0",
"package_version": "1.0.0",
"creation_timestamp": "2026-01-27T10:00:00Z",
"created_by": "AI Team",
"num_test_images": 100,
"model_weights_sha256": "abc123...",
"preprocessing": {
"resize": [272, 272],
"normalization": "imagenet",
"color_space": "RGB"
},
"acceptance_criteria": {
"metric": "output_tolerance",
"tolerance": 1e-5,
"pass_rate_required": 1.0
},
"development_report_reference": "R-TF-028-005 v1.0"
}

Acceptance Criteria​

The following acceptance criteria apply to integration verification:

Model TypeMetricAcceptance Criterion
Classification (ICD, Binary Indicators)Probability differenceε ≤ 1e-5 per class
Intensity QuantificationOutput score differenceε ≤ 1e-5
SegmentationMask IoU≥ 0.9999
DetectionBox IoU + class matchIoU ≥ 0.9999, exact class match
Quality Assessment (DIQA)Score differenceε ≤ 1e-5

Overall Pass Criterion: 100% of test images must meet the acceptance criteria for the integration verification to pass.

Model-Specific Package Details​

The following table summarizes the Integration Verification Package for each model:

Clinical Models - ICD Classification and Binary Indicators​

ModelOutput TypeStorage Path
ICD Category DistributionClassification probabilities (346 classes)icd-category-distribution/
Binary Indicators6 probability scoresicd-category-distribution/

Clinical Models - Visual Sign Intensity Quantification​

ModelOutput TypeStorage Path
Erythema IntensityRegression (0-9 scale)erythema-intensity/
Desquamation IntensityRegression (0-9 scale)desquamation-intensity/
Induration IntensityRegression (0-9 scale)induration-intensity/
Pustule IntensityRegression (0-9 scale)pustule-intensity/
Crusting IntensityRegression (0-9 scale)crusting-intensity/
Xerosis IntensityRegression (0-9 scale)xerosis-intensity/
Swelling IntensityRegression (0-9 scale)swelling-intensity/
Oozing IntensityRegression (0-9 scale)oozing-intensity/
Excoriation IntensityRegression (0-9 scale)excoriation-intensity/
Lichenification IntensityRegression (0-9 scale)lichenification-intensity/

Clinical Models - Wound Characteristic Assessment​

ModelOutput TypeStorage Path
Wound Edge: DiffusedBinary classificationwound-edge-diffused/
Wound Edge: ThickenedBinary classificationwound-edge-thickened/
Wound Edge: DelimitedBinary classificationwound-edge-delimited/
Wound Edge: IndistinguishableBinary classificationwound-edge-indistinguishable/
Wound Edge: DamagedBinary classificationwound-edge-damaged/
Wound Tissue: BoneBinary classificationwound-tissue-bone/
Wound Tissue: SubcutaneousBinary classificationwound-tissue-subcutaneous/
Wound Tissue: MuscleBinary classificationwound-tissue-muscle/
Wound Tissue: IntactBinary classificationwound-tissue-intact/
Wound Tissue: Dermis-EpidermisBinary classificationwound-tissue-dermis-epidermis/
Wound Bed: NecroticBinary classificationwound-bed-necrotic/
Wound Bed: ClosedBinary classificationwound-bed-closed/
Wound Bed: GranulationBinary classificationwound-bed-granulation/
Wound Bed: EpithelialBinary classificationwound-bed-epithelial/
Wound Bed: SloughBinary classificationwound-bed-slough/
Wound Exudate: SerousBinary classificationwound-exudate-serous/
Wound Exudate: FibrinousBinary classificationwound-exudate-fibrinous/
Wound Exudate: PurulentBinary classificationwound-exudate-purulent/
Wound Exudate: BloodyBinary classificationwound-exudate-bloody/
Perilesional ErythemaBinary classificationperilesional-erythema/
Perilesional MacerationBinary classificationperilesional-maceration/
Biofilm TissueBinary classificationbiofilm-tissue/
Wound Stage ClassificationMulti-class (6 stages)wound-stage/
Wound Intensity (AWOSI)Regression (0-10 scale)wound-awosi/

Clinical Models - Lesion Quantification​

ModelOutput TypeStorage Path
Inflammatory Nodular LesionDetection (bounding boxes + count)inflammatory-nodular/
Acneiform Lesion TypesMulti-class detection (5 classes)acneiform-lesion-types/
Inflammatory LesionDetection (bounding boxes + count)inflammatory-lesion/
Hive LesionDetection (bounding boxes + count)hive-lesion/
Nail Lesion SurfaceSegmentation masknail-lesion-surface/

Clinical Models - Surface Area Quantification​

ModelOutput TypeStorage Path
Wound Bed SurfaceSegmentation maskwound-bed-surface/
Wound Granulation SurfaceSegmentation maskwound-granulation-surface/
Wound Biofilm/Slough SurfaceSegmentation maskwound-biofilm-surface/
Wound Necrosis SurfaceSegmentation maskwound-necrosis-surface/
Wound Maceration SurfaceSegmentation maskwound-maceration-surface/
Wound Orthopedic Material SurfaceSegmentation maskwound-orthopedic-surface/
Wound Bone/Cartilage/Tendon SurfaceSegmentation maskwound-bone-surface/
Hair Loss SurfaceSegmentation maskhair-loss-surface/
Hypopigmentation/DepigmentationSegmentation maskhypopigmentation-surface/
Hyperpigmentation SurfaceSegmentation maskhyperpigmentation-surface/
Erythema SurfaceSegmentation maskerythema-surface/

Clinical Models - Pattern Identification​

ModelOutput TypeStorage Path
Acneiform Inflammatory PatternRegression (IGA 0-4 scale)acneiform-pattern/
Follicular and Inflammatory PatternMulti-class (Hurley stages)follicular-inflammatory-pattern/
Inflammatory PatternClassificationinflammatory-pattern/
Inflammatory Pattern IndicatorBinary classificationinflammatory-pattern-indicator/

Non-Clinical Models​

ModelOutput TypeStorage Path
DIQAQuality score (0-1)diqa/
Domain ValidationClassification (3 classes)domain-validation/
Skin Surface SegmentationSegmentation maskskin-surface-segmentation/
Body Surface SegmentationSegmentation maskbody-surface-segmentation/
Head DetectionDetection (bounding boxes)head-detection/

Verification Procedure for Software Integration Team​

The Software Development team shall follow this procedure after model integration:

  1. Environment Preparation:

    • Configure the integration environment with dependencies specified in R-TF-028-006 AI Release Report
    • Download the Integration Verification Package from S3
    • Verify package integrity using manifest checksums
  2. Inference Execution:

    • Process all reference test images through the integrated model
    • Record outputs in the same format as expected_outputs.csv
    • Document runtime environment configuration
  3. Output Comparison:

    • Compare actual outputs against expected outputs using acceptance criteria
    • Calculate match rate for each image
    • Flag any discrepancies
  4. Results Documentation:

    • Generate Integration Verification Report including:
      • Test execution date and environment details
      • Pass/fail status per image
      • Overall pass rate
      • Any deviations with root cause analysis
    • Store report as software verification evidence per IEC 62304

Traceability​

ArtifactVersionReference
AI Development Report1.0This document
AI Release Report1.0R-TF-028-006
Integration Verification Package1.0S3 bucket
Model WeightsPer modelSee manifest.json

The Integration Verification Package version is locked to the corresponding model version and AI Development Report. Any model retraining requires generation of a new Integration Verification Package.

AI Risks Assessment Report​

AI Risk Assessment​

A comprehensive risk assessment was conducted throughout the development lifecycle in accordance with the R-TF-028-002 AI Development Plan. All identified AI-specific risks related to data, model training, and performance were documented and analyzed in the R-TF-028-011 AI Risk Matrix.

AI Risk Treatment​

Control measures were implemented to mitigate all identified risks. Key controls included:

  • Rigorous data curation and multi-source collection to mitigate bias.
  • Systematic model training and validation procedures to prevent overfitting.
  • Use of a sequestered test set to ensure unbiased performance evaluation.
  • Implementation of model calibration to improve the reliability of outputs.

Residual AI Risk Assessment​

After the implementation of control measures, a residual risk analysis was performed. All identified AI risks were successfully reduced to an acceptable level.

AI Risk and Traceability with Safety Risk​

Safety risks related to the AI algorithms (e.g., incorrect assessment suggestion, misinterpretation of data) were identified and traced back to their root causes in the AI development process. These safety risks have been escalated for management in the overall device Safety Risk Matrix, in line with ISO 14971.

Conclusion​

The AI development process has successfully managed and mitigated inherent risks to an acceptable level. The benefits of using the Legit.Health Plus algorithms as a clinical decision support tool are judged to outweigh the residual risks.

Related Documents​

Project Design and Plan​

  • R-TF-028-001 AI/ML Description - Complete specifications for all AI models
  • R-TF-028-002 AI Development Plan - Development methodology and lifecycle
  • R-TF-028-011 AI Risk Matrix - AI-specific risk assessment and mitigation

Data Collection and Annotation​

  • R-TF-028-003 Data Collection Instructions - Public datasets and clinical study data collection protocols
  • R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping - Foundational clinical label standardization (completed)
  • R-TF-028-004 Data Annotation Instructions - Visual Signs - Intensity, count, and extent annotations for visual sign models (completed)
  • R-TF-028-004 Data Annotation Instructions - DIQA - Image quality assessment annotations (to be created)
  • R-TF-028-004 Data Annotation Instructions - Fitzpatrick - Skin type annotations (to be created)
  • R-TF-028-004 Data Annotation Instructions - Body Site - Anatomical location annotations (if needed)

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

  • Author: JD-009
  • Reviewer: JD-009
  • Approver: JD-005
Previous
R-TF-028-004 Data Annotation Instructions - Visual Signs
Next
R-TF-028-006 AI Release Report
  • Introduction
    • Context
    • Algorithms Description
    • AI Standalone Evaluation Objectives
  • Data Management
    • Overview
    • Data Collection
    • Foundational Annotation: ICD-11 Mapping
  • Model Development and Validation
    • ICD Category Distribution and Binary Indicators
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Erythema Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Desquamation Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Induration Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Pustule Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Crusting Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Xerosis Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Swelling Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Oozing Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Excoriation Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Lichenification Intensity Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Wound Characteristic Assessment
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Inflammatory Nodular Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Acneiform Lesion Type Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hair Follicle Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Acneiform Inflammatory Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hive Lesion Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Body Surface Segmentation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Wound Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Erythema Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hair Loss Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Nail Lesion Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hypopigmentation or Depigmentation Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Hyperpigmentation Surface Quantification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Skin Surface Segmentation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Follicular and Inflammatory Pattern Identification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Inflammatory Nodular Lesion Pattern Identification
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Dermatology Image Quality Assessment (DIQA)
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Domain Validation
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
    • Head Detection
      • Model Overview
      • Data Requirements and Annotation
      • Training Methodology
      • Performance Results
      • Verification and Validation Protocol
      • Bias Analysis and Fairness Evaluation
  • Summary and Conclusion
  • State of the Art Compliance and Development Lifecycle
    • Software Development Lifecycle Compliance
    • State of the Art in AI Development
    • Risk Management Throughout Lifecycle
    • Information Security
    • Verification and Validation Strategy
  • Integration Verification Package
    • Purpose
    • Package Location and Structure
    • Package Contents Per Model
      • Reference Test Images
      • Expected Outputs File
      • Verification Manifest
    • Acceptance Criteria
    • Model-Specific Package Details
      • Clinical Models - ICD Classification and Binary Indicators
      • Clinical Models - Visual Sign Intensity Quantification
      • Clinical Models - Wound Characteristic Assessment
      • Clinical Models - Lesion Quantification
      • Clinical Models - Surface Area Quantification
      • Clinical Models - Pattern Identification
      • Non-Clinical Models
    • Verification Procedure for Software Integration Team
    • Traceability
  • AI Risks Assessment Report
    • AI Risk Assessment
    • AI Risk Treatment
    • Residual AI Risk Assessment
    • AI Risk and Traceability with Safety Risk
    • Conclusion
  • Related Documents
    • Project Design and Plan
    • Data Collection and Annotation
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI Labs Group S.L.)