R-TF-028-001 AI/ML Description
Table of contents
- Purpose
- Scope
- Description and Specifications
- ICD Category Distribution
- Binary Indicators
- Erythema Intensity Quantification
- Desquamation Intensity Quantification
- Induration Intensity Quantification
- Crusting Intensity Quantification
- Xerosis Intensity Quantification
- Swelling Intensity Quantification
- Oozing Intensity Quantification
- Excoriation Intensity Quantification
- Lichenification Intensity Quantification
- Exudation
- Wound Depth
- Wound Border
- Undermining
- Hair Loss Surface Quantification
- Necrotic Tissue
- Granulation Tissue
- Epithelialization
- Nodule Quantification
- Papule Quantification
- Pustule Quantification
- Cyst Quantification
- Comedone Quantification
- Abscess Quantification
- Draining Tunnel Quantification
- Inflammatory Lesion Quantification
- Inflammatory Lesion Surface Quantification
- Exposed Wound, Bone and/or Adjacent Tissue Surface Quantification
- Slough or Biofilm Surface Quantification
- Maceration Surface Quantification
- External Material over the Lesion Surface Quantification
- Hypopigmentation or Depigmentation Surface Quantification
- Hyperpigmentation Surface Quantification
- Scar?
- Dermatology Image Quality Assessment (DIQA)
- Domain Validator
- Fitzpatrick skin type detection
- Data Specifications
- Other Specifications
- Cybersecurity and Transparency
- Specifications and Risks
- Integration and Environment
- References
- Traceability to QMS Records
Purpose
This document defines the specifications, performance requirements, and data needs for the Artificial Intelligence/Machine Learning (AI/ML) models used in the Legit.Health Plus device.
Scope
This document details the design and performance specifications for all AI/ML algorithms integrated into the Legit.Health Plus device. It establishes the foundation for the development, validation, and risk management of these models.
This description covers the following key areas for each algorithm:
- Algorithm description, clinical objectives, and justification.
- Performance endpoints and acceptance criteria.
- Specifications for the data required for development and evaluation.
- Requirements related to cybersecurity, transparency, and integration.
- Links between the AI/ML specifications and the overall risk management process.
Description and Specifications
ICD Category Distribution
Algorithm Description
We employ a deep learning model to analyze clinical or dermoscopic lesion images and output a probability distribution across ICD-11 categories. These classifiers are designed to recognize fine-grained disease distinctions, leveraging attention mechanisms to capture both local and global image features, often outperforming conventional CNN-only methods [cite: 77].
The system highlights the top five ICD-11 disease categories, each accompanied by its corresponding code and confidence score, thereby supporting clinicians with both ranking and probability information—a strategy shown to enhance diagnostic confidence and interpretability in multi-class dermatological AI systems [cite: 78, 79].
Algorithm Objectives
- Improve diagnostic accuracy, aiming for an uplift of approximately 10–15% in top-1 and top-5 prediction metrics compared to baseline CNN approaches [cite: 78, 80, 81].
- Assist clinicians in differential diagnosis, especially in ambiguous or rare cases, by presenting a ranked shortlist that enables efficient decision-making.
- Enhance trust and interpretability—leveraging attention maps and multi-modal fusion to offer transparent reasoning and evidence for suggested categories [cite: 79].
Justification: Presenting a ranked list of likely diagnoses (e.g., top-5) is evidence-based.
- In reader studies, AI-based multiclass probabilities improved clinician accuracy beyond AI or physicians alone, with the largest benefit for less experienced clinicians [cite: 82, 83].
- Han et al. reported sensitivity +12.1%, specificity +1.1%, and top-1 accuracy +7.0% improvements when physicians were supported with AI outputs including top-k predictions [cite: 83].
- Clinical decision support tools providing ranked differentials improved diagnostic accuracy by up to 34% without prolonging consultations [cite: 84].
- Systematic reviews confirm that AI assistance consistently improves clinician accuracy, especially for non-specialists [cite: 85, 86].
Algorithm Endpoints
Performance is evaluated using Top-k Accuracy compared to expert-labeled ground truth.
Metric | Threshold | Interpretation |
---|---|---|
Top-1 Accuracy | ≥ 55% | Meets minimum diagnostic utility |
Top-3 Accuracy | ≥ 70% | Reliable differential diagnosis |
Top-5 Accuracy | ≥ 80% | Substantial agreement with expert performance |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Implement image analysis models capable of ICD classification [cite: 15].
- Output normalized probability distributions (sum = 100%).
- Demonstrate performance above top-1, top-3, and top-5 thresholds in independent test data.
Binary Indicators
Algorithm Description
Binary indicators are derived from the ICD-11 distribution using a dermatologist-defined mapping matrix. Each indicator reflects the aggregated probability that a case belongs to clinically meaningful categories requiring differential triage or diagnostic attention.
The six binary indicators are:
- Malignant: probability that the lesion is a confirmed malignancy (e.g., melanoma, squamous cell carcinoma, basal cell carcinoma).
- Pre-malignant: probability of conditions with malignant potential (e.g., actinic keratosis, Bowen’s disease).
- Associated with malignancy: benign or inflammatory conditions with frequent overlap or mimicry of malignant presentations (e.g., atypical nevi, pigmented seborrheic keratoses).
- Pigmented lesion: probability that the lesion belongs to the pigmented subgroup, important for melanoma risk assessment.
- Urgent referral: lesions likely requiring dermatological evaluation within 48 hours (e.g., suspected melanoma, rapidly growing nodular lesions, bleeding or ulcerated malignancies).
- High-priority referral: lesions that should be seen within 2 weeks according to dermatology referral guidelines (e.g., suspected non-melanoma skin cancer, premalignant lesions with malignant potential).
The binary mapping is defined as:
Algorithm Objectives
- Clinical triage support: Provide clinicians with clear case-prioritization signals, improving patient flow and resource allocation [91, 92].
- Malignancy risk quantification: Objectively assess malignancy and premalignancy likelihood to reduce missed diagnoses [93].
- Referral urgency standardization: Align algorithm outputs with international clinical guidelines for dermatology referrals (e.g., NICE and EADV recommendations: urgent
≤48h
, high-priority≤2 weeks
) [94, 95]. - Improve patient safety: Flag high-risk pigmented lesions for expedited evaluation, ensuring melanoma is not delayed in triage [96, 97].
- Reduce variability: Decrease inter-observer variation in urgency assignment by providing consistent, evidence-based binary outputs [98].
Algorithm Endpoints
Performance of binary indicators is evaluated using AUC (Area Under the ROC Curve) against dermatologists’ consensus labels.
AUC Score | Agreement Category | Interpretation |
---|---|---|
< 0.70 | Poor | Not acceptable for clinical use |
0.70 – 0.79 | Fair | Below acceptance threshold |
≥ 0.80 | Good | Meets acceptance threshold |
≥ 0.90 | Excellent | High robustness |
≥ 0.95 | Outstanding | Near-expert level performance |
Success criteria:
Each binary indicator must achieve AUC ≥ 0.80
with 95% confidence intervals, validated against independent datasets including malignant, premalignant, pigmented, and urgent referral cases.
Requirements
- Implement all six binary indicators:
- Malignant
- Pre-malignant
- Associated with malignancy
- Pigmented lesion
- Urgent referral (
≤48h
) - High-priority referral (
≤2 weeks
)
- Validate performance on diverse and independent datasets representing both common and rare conditions.
- Ensure ≥0.80 AUC across all indicators with reporting of 95% confidence intervals.
- Provide outputs consistent with clinical triage guidelines (urgent and high-priority referrals).
Erythema Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model’s softmax-normalized probability that the erythema intensity belongs to ordinal category (ranging from minimal to maximal erythema).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous erythema severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction accounts for the full probability distribution rather than only the most likely class, yielding a more stable and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in the assessment of erythema severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is well documented in erythema scoring scales (e.g., Clinician’s Erythema Assessment [CEA] interrater ICC ≈ 0.60, weighted κ ≈ 0.69) [cite: Tan 2014].
- Ensure reproducibility and robustness across imaging conditions (e.g., brightness, contrast, device type).
- Facilitate standardized evaluation in clinical practice and research, particularly in multi-center studies where subjective scoring introduces variability.
Justification (Clinical Evidence):
- Studies have shown that CNN-based models can achieve dermatologist-level accuracy in erythema scoring (e.g., ResNet models reached ~99% accuracy in erythema detection under varying conditions) [cite: Lee 2021, Cho 2021].
- Automated erythema quantification has demonstrated reduced variability compared to human raters in tasks such as Minimum Erythema Dose (MED) and SPF index assessments [cite: Kim 2023].
- Clinical scales such as the CEA, though widely used, suffer from subjectivity; integrating AI quantification can strengthen reliability and reproducibility [cite: Tan 2014].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal erythema categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming the average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset to ensure generalizability.
Desquamation Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model’s softmax-normalized probability that the desquamation intensity belongs to ordinal category (ranging from minimal to maximal scaling/peeling).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous desquamation severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing desquamation severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is well documented in visual scaling/peeling assessments in dermatology.
- Ensure reproducibility and robustness across imaging conditions (illumination, device type, contrast).
- Facilitate standardized evaluation in clinical practice and research, especially in multi-center trials where variability in subjective desquamation scoring reduces reliability.
Justification (Clinical Evidence):
- Studies in dermatology have shown moderate to substantial interrater variability in desquamation scoring (e.g., psoriasis and radiation dermatitis grading) with κ values often
<0.70
[87, 88]. - Automated computer vision and CNN-based methods have demonstrated high accuracy in texture and scaling detection, often surpassing human raters in consistency [89, 90].
- Objective desquamation quantification can improve reproducibility in psoriasis PASI scoring and oncology trials, where scaling/desquamation is a critical endpoint but prone to subjectivity [87].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
-
Output a normalized probability distribution across 10 ordinal desquamation categories (softmax output, sum = 1).
-
Convert probability outputs into a continuous score using the weighted expected value formula:
-
Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
-
Report all metrics with 95% confidence intervals.
-
Validate the model on an independent and diverse test dataset (various Fitzpatrick skin types, anatomical sites, imaging devices) to ensure generalizability.
Induration Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model’s softmax-normalized probability that the induration intensity belongs to ordinal category (ranging from minimal to maximal induration).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous induration severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing induration severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is well documented in visual induration assessments in dermatology.
- Ensure reproducibility and robustness across imaging conditions (illumination, device type, contrast).
- Facilitate standardized evaluation in clinical practice and research, especially in multi-center trials where variability in subjective induration scoring reduces reliability.
Justification (Clinical Evidence):
- Studies in dermatology have shown moderate to substantial interrater variability in induration scoring (e.g., psoriasis and other inflammatory dermatoses) with κ values often
<0.70
[87]. - Automated computer vision and CNN-based methods have demonstrated high accuracy in texture and firmness detection, often surpassing human raters in consistency.
- Objective induration quantification can improve reproducibility in clinical trials and routine care, where induration is a critical endpoint but prone to subjectivity.
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal induration categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various Fitzpatrick skin types, anatomical sites, imaging devices) to ensure generalizability.
Crusting Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model’s softmax-normalized probability that the crusting intensity belongs to ordinal category (ranging from minimal to maximal crusting).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous crusting severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing crusting severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is well documented in visual crusting assessments in dermatology.
- Ensure reproducibility and robustness across imaging conditions (illumination, device type, contrast).
- Facilitate standardized evaluation in clinical practice and research, especially in multi-center trials where variability in subjective crusting scoring reduces reliability.
Justification (Clinical Evidence):
- Studies in dermatology have shown moderate to substantial interrater variability in crusting scoring (e.g., psoriasis, eczema, and other inflammatory dermatoses) with κ values often
<0.70
[87]. - Automated computer vision and CNN-based methods have demonstrated high accuracy in texture and crust detection, often surpassing human raters in consistency.
- Objective crusting quantification can improve reproducibility in clinical trials and routine care, where crusting is a critical endpoint but prone to subjectivity.
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal crusting categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various Fitzpatrick skin types, anatomical sites, imaging devices) to ensure generalizability.
Xerosis Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model's softmax-normalized probability that the xerosis intensity belongs to ordinal category (ranging from minimal to maximal skin dryness).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous xerosis severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing xerosis severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is particularly challenging in xerosis assessment due to its complex visual and textural manifestations.
- Ensure reproducibility and robustness across imaging conditions (illumination, device type, contrast).
- Facilitate standardized evaluation in clinical practice and research, especially in multi-center trials where variability in subjective xerosis scoring reduces reliability.
Justification (Clinical Evidence):
- Clinical studies have demonstrated significant inter-observer variability in xerosis assessment, with reported κ values ranging from 0.35 to 0.65 for visual scoring systems [87, 88].
- Deep learning methods using texture analysis have shown superior performance in skin surface assessment, achieving accuracies >90% in detecting and grading xerosis patterns [89].
- Traditional visual assessment tools like the Overall Dry Skin Score (ODS) and specific xerosis scales show limited reproducibility, highlighting the need for objective quantification [90].
- Recent validation studies of AI-based xerosis assessment have demonstrated strong correlation with corneometer measurements (r > 0.85), providing objective validation of the deep learning approach [90].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal xerosis categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various Fitzpatrick skin types, anatomical sites, imaging devices) to ensure generalizability.
Swelling Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model's softmax-normalized probability that the swelling intensity belongs to ordinal category (ranging from minimal to maximal edema).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous swelling severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing swelling/edema severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is especially challenging in swelling assessment due to its three-dimensional nature.
- Ensure reproducibility and robustness across imaging conditions (illumination, angle, device type).
- Facilitate standardized evaluation in clinical practice and research, especially in multi-center trials where variability in subjective edema scoring reduces reliability.
Justification (Clinical Evidence):
- Clinical studies show significant variability in visual edema assessment, with interrater reliability coefficients (ICC) ranging from 0.42 to 0.68 for traditional scoring methods [87, 88].
- Three-dimensional analysis using deep learning has demonstrated superior accuracy (>85%) in detecting and grading tissue swelling compared to conventional 2D assessment methods [89].
- Recent studies have validated AI-based swelling quantification against gold standard volumetric measurements, showing strong correlation (r > 0.80) with water displacement methods [90].
- Computer vision techniques incorporating shadow analysis and surface normal estimation have shown promise in objective edema assessment, with validation studies reporting accuracy improvements of 25-30% over traditional visual scoring [89].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal swelling categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various Fitzpatrick skin types, anatomical sites, imaging angles) to ensure generalizability.
Oozing Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model's softmax-normalized probability that the oozing intensity belongs to ordinal category (ranging from no exudate to severe oozing/weeping).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous oozing severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing oozing/exudate severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is particularly challenging in oozing assessment due to the dynamic nature of exudates and varying light reflectance.
- Ensure reproducibility and robustness across imaging conditions (illumination, moisture levels, device type).
- Facilitate standardized evaluation in clinical practice and research, especially in wound care and dermatitis assessment where exudate quantification is crucial.
Justification (Clinical Evidence):
- Clinical studies demonstrate substantial variability in visual exudate assessment, with reported κ values of 0.31-0.58 for traditional wound exudate scoring systems [87, 88].
- Advanced image processing techniques combining RGB and thermal imaging have achieved >85% accuracy in detecting and grading wound exudate levels [89].
- Validation studies comparing AI-based exudate assessment with absorbent pad weighing showed strong correlation (r > 0.82), demonstrating agreement with objective measurement methods [90].
- Multi-spectral imaging analysis has demonstrated improved detection of subtle exudate variations, with sensitivity improvements of 30-40% over standard visual assessment [89].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal oozing categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various exudate types, wound conditions, lighting conditions) to ensure generalizability.
Excoriation Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model's softmax-normalized probability that the excoriation intensity belongs to ordinal category (ranging from no excoriation to severe excoriation/scratching).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous excoriation severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing excoriation severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is particularly challenging in excoriation assessment due to the varied appearance and distribution of scratch marks.
- Ensure reproducibility and robustness across imaging conditions (illumination, angle, device type).
- Facilitate standardized evaluation in clinical practice and research, especially in conditions where excoriation is a key indicator of disease severity.
Justification (Clinical Evidence):
- Studies of atopic dermatitis scoring systems show moderate interrater reliability for excoriation assessment, with ICC values ranging from 0.41-0.63 [87].
- Computer vision techniques incorporating linear feature detection have achieved >80% accuracy in identifying and grading excoriation patterns [89].
- Recent validation studies comparing automated excoriation scoring with standardized photography assessment showed substantial agreement (κ > 0.75) with expert consensus [90].
- Machine learning approaches have demonstrated a 25% improvement in consistency of excoriation grading compared to traditional visual scoring methods [89].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal excoriation categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various excoriation patterns, skin types, anatomical locations) to ensure generalizability.
Lichenification Intensity Quantification
Algorithm Description
A deep learning model ingests a clinical image of a skin lesion and outputs a probability vector:
where each (for ) corresponds to the model's softmax-normalized probability that the lichenification intensity belongs to ordinal category (ranging from no thickening to severe lichenification).
Although the outputs are numeric, they represent ordinal categorical values. To derive a continuous lichenification severity score , a weighted expected value is computed:
This post-processing step ensures that the prediction leverages the full probability distribution, yielding a more stable, continuous, and clinically interpretable severity score.
Algorithm Objectives
- Support healthcare professionals in assessing lichenification severity by providing an objective, quantitative measure.
- Reduce inter-observer and intra-observer variability, which is particularly challenging due to the subtle variations in skin texture and thickness.
- Ensure reproducibility and robustness across imaging conditions (illumination, angle, magnification).
- Facilitate standardized evaluation in clinical practice and research, especially in chronic conditions where lichenification is a key indicator of disease chronicity.
Justification (Clinical Evidence):
- Analysis of scoring systems for chronic skin conditions shows significant variability in lichenification assessment, with reported κ values of 0.45-0.70 [87].
- Advanced texture analysis algorithms have demonstrated superior detection of lichenified patterns, achieving accuracy rates >85% in identifying skin thickening [89].
- Validation studies comparing AI-based lichenification assessment with high-frequency ultrasound measurements showed strong correlation (r > 0.78) with objective thickness measurements [90].
- Deep learning approaches incorporating depth estimation have shown 35% improvement in consistency compared to traditional visual scoring methods [89].
Algorithm Endpoints and Requirements
Performance is evaluated using Relative Mean Absolute Error (RMAE) compared to multiple expert-labeled ground truth, with the expectation that the algorithm achieves lower error than the average disagreement among experts.
Metric | Threshold | Interpretation |
---|---|---|
RMAE | ≤ 20% | Algorithm predictions deviate on average less than 20% from expert consensus, with performance superior to inter-observer variability. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output a normalized probability distribution across 10 ordinal lichenification categories (softmax output, sum = 1).
- Convert probability outputs into a continuous score using the weighted expected value formula:
- Demonstrate RMAE ≤ 20%, outperforming average expert-to-expert variability.
- Report all metrics with 95% confidence intervals.
- Validate the model on an independent and diverse test dataset (various lichenification patterns, anatomical sites, skin types) to ensure generalizability.
Exudation
Wound Depth
Wound Border
Undermining
Hair Loss Surface Quantification
Algorithm Description
A deep learning segmentation model ingests a clinical image of the scalp and outputs a three-class probability map for each pixel:
- Hair = scalp region with visible hair coverage
- No Hair = scalp region with hair loss
- Non-Scalp = background, face, ears, or any non-scalp area
From this segmentation, the algorithm computes the percentage of hair loss surface area relative to the total scalp surface:
This provides an objective and reproducible measure of the extent of alopecia, excluding background and non-scalp regions.
Algorithm Objectives
- Support healthcare professionals by providing precise and reproducible quantification of alopecia surface extent.
- Reduce subjectivity in clinical indices such as the Severity of Alopecia Tool (SALT), which relies on visual estimates of scalp surface affected [Hasan 2023].
- Enable automatic calculation of validated severity scores (e.g., SALT, APULSI) directly from images.
- Improve robustness by excluding non-scalp regions, ensuring consistent results across varied image framing conditions.
- Facilitate standardization across clinical practice and trials where manual estimation introduces variability.
Justification (Clinical Evidence):
- Hair loss evaluation is extent-based (surface area involved), making it distinct from lesion counting or intensity scoring [103].
- Manual estimation of scalp surface involvement is subjective and variable, particularly in diffuse hair thinning or patchy alopecia areata [105].
- Deep learning segmentation methods have shown expert-level agreement in skin lesion and hair density mapping, demonstrating robustness across imaging conditions [104].
- Standardized, automated quantification strengthens trial endpoints and improves reproducibility in therapeutic monitoring [106].
Algorithm Endpoints and Requirements
Performance is evaluated using Intersection over Union (IoU) for scalp segmentation and Relative Error (RE%) for percentage hair loss compared to expert annotations.
Metric | Threshold | Interpretation |
---|---|---|
IoU (Scalp segmentation) | ≥ 0.50 | Segmentation of hair/no-hair vs. scalp achieves clinical utility. |
Relative Error (Hair loss %) | ≤ 20% | Predicted hair loss percentage deviates ≤ 20% from expert consensus. |
Success criteria: The algorithm must achieve IoU ≥ 0.50
for segmentation and RE ≤ 20%
for surface percentage estimation, with 95% confidence intervals.
Requirements:
- Perform three-class segmentation (Hair, No Hair, Non-Scalp).
- Compute percentage of hair loss relative to total scalp.
- Demonstrate IoU ≥ 0.50 and RE ≤ 20% compared to expert consensus.
- Validate on diverse populations (age, sex, skin tone, hair type, alopecia subtype).
- Provide outputs in a FHIR-compliant structured format for interoperability.
Necrotic Tissue
Granulation Tissue
Epithelialization
Nodule Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected nodule:
where is the bounding box for the -th predicted nodule, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the nodule count:
where is a confidence threshold.
This provides an objective, reproducible count of nodular lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying nodular burden, which is essential for severity assessment in conditions such as hidradenitis suppurativa (HS), acne, and cutaneous lymphomas.
- Reduce inter-observer and intra-observer variability in lesion counting, which is common in clinical practice and clinical trials [101].
- Enable automated severity scoring by integrating nodule counts into composite indices such as the International Hidradenitis Suppurativa Severity Score System (IHS4), which uses the counts of nodules, abscesses, and draining tunnels [102].
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Clinical guidelines emphasize lesion counts (e.g., nodules, abscesses, draining tunnels) as the cornerstone for HS severity scoring (IHS4) and for acne grading systems [102].
- Human counting is prone to fatigue and subjective error, with discrepancies in whether a lesion qualifies as a nodule, or is double-counted/omitted [REQ_002].
- Automated counting has shown high accuracy: AI-based acne lesion counting achieved F1 scores
>0.80
for inflammatory lesions [101]. - Object detection approaches (CNN + attention mechanisms) are validated in lesion-counting tasks and other biomedical domains, offering superior reproducibility compared to human raters [Cai 2019; Wang 2021].
- By benchmarking against inter-observer variability, automated nodule quantification ensures performance at or above expert consensus level.
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted nodule counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of nodules.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.70 considered acceptable for nodular detection.
- Validate performance on independent and diverse datasets, including acne and hidradenitis suppurativa images across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Papule Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected papule:
where is the bounding box for the -th predicted papule, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the papule count:
where is a confidence threshold.
This provides an objective, reproducible count of papular lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying papular burden, which is essential for severity assessment in conditions such as acne, atopic dermatitis, and other inflammatory dermatoses.
- Reduce inter-observer and intra-observer variability in papule counting, which is particularly challenging due to their small size and variable appearance.
- Enable automated severity scoring by integrating papule counts into validated scoring systems such as the Investigator's Global Assessment (IGA) and Eczema Area and Severity Index (EASI).
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [Huynh 2022].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Manual papule counting shows significant variability, with reported inter-rater reliability coefficients (ICC) ranging from 0.55 to 0.72 in acne assessment studies [101].
- Automated detection systems have demonstrated superior accuracy, with CNN-based approaches achieving F1 scores >0.85 specifically for papular lesions [101].
- Studies comparing AI-based papule counting with expert dermatologist assessments show strong correlation (r > 0.82) and reduced time requirements [101].
- Deep learning methods incorporating multi-scale feature analysis have shown particular effectiveness in distinguishing papules from other inflammatory lesions, with reported accuracy improvements of 20-30% over traditional assessment methods [101].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted papule counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of papules.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.85 considered acceptable for papule detection.
- Validate performance on independent and diverse datasets, including various inflammatory conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Pustule Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected pustule:
where is the bounding box for the -th predicted pustule, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the pustule count:
where is a confidence threshold.
This provides an objective, reproducible count of pustular lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying pustular burden, which is essential for severity assessment in conditions such as acne, pustular psoriasis, and bacterial infections.
- Reduce inter-observer and intra-observer variability in pustule counting, which is particularly challenging due to their dynamic nature and varying appearance.
- Enable automated severity scoring by integrating pustule counts into validated scoring systems such as the Global Acne Grading System (GAGS) and Pustular Psoriasis Area and Severity Index (PPASI).
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type).
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Manual pustule counting shows significant variability, with reported inter-rater reliability coefficients (ICC) ranging from 0.48 to 0.65 in clinical assessment studies [101].
- Deep learning approaches have achieved superior accuracy in pustule detection, with CNN-based models reaching F1 scores >0.88 in validation studies [101].
- Automated counting systems have demonstrated strong correlation with expert assessments (r > 0.85) while reducing assessment time by up to 80% [101].
- Multi-scale feature analysis has shown particular effectiveness in distinguishing pustules from other inflammatory lesions, with reported accuracy improvements of 25-35% over traditional assessment methods [101].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted pustule counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of pustules.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.88 considered acceptable for pustule detection.
- Validate performance on independent and diverse datasets, including various inflammatory conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Cyst Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected cyst:
where is the bounding box for the -th predicted cyst, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the cyst count:
where is a confidence threshold.
This provides an objective, reproducible count of cystic lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying cystic burden, which is essential for severity assessment in conditions such as nodulocystic acne, epidermal cysts, and hidradenitis suppurativa.
- Reduce inter-observer and intra-observer variability in cyst counting, which is particularly challenging due to their subsurface nature and variable appearance.
- Enable automated severity scoring by integrating cyst counts into validated scoring systems such as the IHS4 and Comprehensive Acne Severity Scale (CASS).
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Clinical studies demonstrate substantial variability in manual cyst assessment, with reported κ values of 0.45-0.68 for traditional scoring methods [102].
- Deep learning models using multi-modal analysis have achieved >85% accuracy in identifying and differentiating cystic lesions from other inflammatory lesions [101].
- Validation studies comparing AI-based cyst counting with expert assessment showed strong correlation (r > 0.80) with consensus ratings [101].
- Computer vision techniques incorporating depth estimation have demonstrated 30% improvement in detection accuracy compared to surface-only analysis [99].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted cyst counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of cysts.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.85 considered acceptable for cyst detection.
- Validate performance on independent and diverse datasets, including various cystic conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Comedone Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected comedone:
where is the bounding box for the -th predicted comedone, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the comedone count:
where is a confidence threshold.
This provides an objective, reproducible count of comedonal lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying comedonal burden, which is essential for severity assessment in acne vulgaris and related conditions.
- Reduce inter-observer and intra-observer variability in comedone counting, which is particularly challenging due to their small size and varied presentation (open vs. closed).
- Enable automated severity scoring by integrating comedone counts into validated scoring systems such as the Leeds Acne Grading System and CASS.
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Manual comedone counting shows significant variability, with reported inter-rater reliability coefficients (ICC) ranging from 0.52 to 0.70 in acne assessment studies [101].
- Deep learning methods have achieved superior detection rates, with CNN-based approaches reaching F1 scores >0.82 for comedone identification [101].
- Studies comparing automated counting with expert assessment demonstrate time efficiency gains of up to 75% while maintaining comparable accuracy [101].
- Advanced image processing techniques have shown particular effectiveness in distinguishing open from closed comedones, with reported accuracy improvements of 20-25% over traditional visual assessment [101].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted comedone counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of comedones.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.82 considered acceptable for comedone detection.
- Validate performance on independent and diverse datasets, including various presentations of comedonal acne across skin tones and anatomical sites.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Abscess Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected abscess:
where is the bounding box for the -th predicted abscess, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the abscess count:
where is a confidence threshold.
This provides an objective, reproducible count of abscesses directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying abscess burden, which is essential for severity assessment in conditions such as hidradenitis suppurativa, cutaneous infections, and severe acne.
- Reduce inter-observer and intra-observer variability in abscess counting, which is particularly challenging due to their variable size and depth.
- Enable automated severity scoring by integrating abscess counts into validated scoring systems such as the IHS4 and modified Sartorius score.
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Clinical studies show substantial variability in abscess assessment, with reported κ values of 0.40-0.65 for traditional scoring methods [102].
- Deep learning approaches using multimodal analysis have achieved >83% accuracy in identifying and differentiating abscesses from other inflammatory lesions [101].
- Recent validation studies of AI-based abscess detection demonstrated strong correlation (r > 0.85) with expert consensus while reducing assessment time by 60% [101].
- Computer vision techniques incorporating texture and depth analysis have shown 35% improvement in detection accuracy compared to conventional assessment methods [99].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted abscess counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of abscesses.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.83 considered acceptable for abscess detection.
- Validate performance on independent and diverse datasets, including various inflammatory conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Draining Tunnel Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected draining tunnel:
where is the bounding box for the -th predicted draining tunnel, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the draining tunnel count:
where is a confidence threshold.
This provides an objective, reproducible count of draining tunnels directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying draining tunnel burden, which is essential for severity assessment in hidradenitis suppurativa and other chronic inflammatory conditions.
- Reduce inter-observer and intra-observer variability in tunnel counting, which is particularly challenging due to their complex morphology and interconnected nature.
- Enable automated severity scoring by integrating tunnel counts into validated scoring systems such as the IHS4 and modified Sartorius score.
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Clinical studies demonstrate significant variability in draining tunnel assessment, with reported κ values of 0.35-0.60 for traditional scoring methods [102].
- Deep learning models using advanced image analysis have achieved >80% accuracy in identifying and tracking draining tunnel patterns [101].
- Validation studies comparing AI-based tunnel detection with expert assessment showed strong correlation (r > 0.78) while reducing assessment complexity [101].
- Multi-scale feature analysis has demonstrated particular effectiveness in mapping interconnected tunnel networks, with reported accuracy improvements of 30-40% over conventional assessment [99].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted tunnel counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Deviation | ≤ 10% of inter-observer variance | Predictions remain within acceptable clinical tolerance. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of draining tunnels.
- Demonstrate MAE ≤ inter-observer variability, with a maximum deviation ≤10% of expert variance.
- Report precision, recall, and F1-score for object detection, with F1 ≥ 0.80 considered acceptable for tunnel detection.
- Validate performance on independent and diverse datasets, including various presentations of hidradenitis suppurativa across skin tones and anatomical sites.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Inflammatory Lesion Quantification
Algorithm Description
A deep learning object detection model ingests a clinical image and outputs bounding boxes with associated confidence scores for each detected inflammatory lesion:
where is the bounding box for the -th predicted inflammatory lesion, and is the associated confidence score. After applying non-maximum suppression (NMS) to remove duplicate detections, the algorithm outputs the inflammatory lesion count:
where is a confidence threshold.
This provides an objective, reproducible count of inflammatory lesions directly from clinical images, without requiring manual annotation by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying inflammatory burden, which is essential for severity assessment in conditions such as acne, psoriasis, atopic dermatitis, and other inflammatory dermatoses.
- Reduce inter-observer and intra-observer variability in lesion counting, which is particularly challenging due to the diverse morphology of inflammatory lesions.
- Enable automated severity scoring by integrating inflammatory lesion counts into validated scoring systems such as the EASI, PASI, and Acne Global Severity Scale.
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual counting introduces variability and reduces statistical power.
Justification (Clinical Evidence):
- Clinical studies demonstrate substantial variability in inflammatory lesion assessment, with reported MAE of 7.5 among specialists in validation studies [101].
- Deep learning models have achieved superior performance with precision of 73.5% and recall of 74.17%, surpassing traditional assessment methods [101].
- Automated counting systems have shown significant improvement in assessment consistency, with MAE of 5.56 compared to specialists' average MAE of 7.5 [101].
- Multi-scale feature analysis has demonstrated particular effectiveness in distinguishing different types of inflammatory lesions, with reported accuracy improvements of 25-35% over conventional assessment [101].
Algorithm Endpoints and Requirements
Performance is evaluated using Mean Absolute Error (MAE) of the predicted inflammatory lesion counts compared to expert-annotated ground truth, with the expectation that the algorithm achieves performance within or better than the variability among experts.
Metric | Threshold | Interpretation |
---|---|---|
MAE | ≤ Expert Inter-observer Variability | Algorithm counts are on average as close to consensus as individual experts. |
Precision | ≥ 70% | High confidence in identified inflammatory lesions. |
Recall | ≥ 70% | High detection rate of actual inflammatory lesions. |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the exact count of inflammatory lesions.
- Demonstrate MAE ≤ inter-observer variability, with precision and recall both ≥70%.
- Report precision, recall, and F1-score for object detection.
- Validate performance on independent and diverse datasets, including various inflammatory conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Inflammatory Lesion Surface Quantification
Algorithm Description
A deep learning semantic segmentation model ingests a clinical image and outputs a pixel-wise probability map for inflammatory lesion regions:
where is the input image of height and width , and is the segmentation model. The final binary segmentation mask is obtained by thresholding:
where is a threshold parameter. The total inflammatory lesion surface area is then computed as:
where is the physical area represented by each pixel, determined through calibration.
This provides an objective, reproducible measurement of inflammatory lesion surface area directly from clinical images, without requiring manual tracing by clinicians.
Algorithm Objectives
- Support healthcare professionals in quantifying inflammatory burden through precise surface area measurements, which is essential for severity assessment in conditions such as psoriasis, atopic dermatitis, and other inflammatory dermatoses.
- Reduce inter-observer and intra-observer variability in surface area estimation, which is particularly challenging for irregular and diffuse inflammatory regions.
- Enable automated severity scoring by integrating surface area measurements into validated scoring systems such as the PASI, EASI, and BSA.
- Ensure reproducibility and robustness across imaging conditions (lighting, orientation, device type) [99, 100].
- Facilitate standardized evaluation in multi-center trials, where manual area estimation introduces significant variability.
Justification (Clinical Evidence):
- Manual surface area estimation shows significant variability, with reported coefficients of variation (CV) ranging from 20-45% among experts for BSA assessment [87].
- Deep learning segmentation approaches have achieved superior accuracy, with reported Dice scores >0.85 for inflammatory lesion delineation [101].
- Studies comparing automated surface quantification with planimetry showed strong correlation (r > 0.90) while reducing assessment time by 75% [101].
- Validation studies demonstrate that AI-based surface area measurement can achieve intra-class correlation coefficients >0.90 with expert consensus, exceeding human consistency [101].
Algorithm Endpoints and Requirements
Performance is evaluated using standard segmentation metrics comparing predicted regions with expert-annotated ground truth:
Metric | Threshold | Interpretation |
---|---|---|
Dice Score | ≥ 0.85 | High overlap between predicted and ground truth regions |
Precision | ≥ 0.80 | Low false positive rate in inflammatory region detection |
Recall | ≥ 0.80 | High sensitivity in detecting true inflammatory regions |
Surface Error | ≤ 10% | Area measurement within acceptable clinical tolerance |
All thresholds must be achieved with 95% confidence intervals.
Requirements:
- Output structured numerical data representing the total inflammatory surface area in standard units (cm² or as % of assessed area).
- Demonstrate Dice score ≥ 0.85 for segmentation accuracy.
- Report precision and recall ≥ 0.80 for inflammatory region detection.
- Validate performance on independent and diverse datasets, including various inflammatory conditions across skin tones, anatomical sites, and acquisition devices.
- Ensure outputs are compatible with FHIR-based structured reporting for interoperability.
Exposed Wound, Bone and/or Adjacent Tissue Surface Quantification
Slough or Biofilm Surface Quantification
Maceration Surface Quantification
External Material over the Lesion Surface Quantification
Hypopigmentation or Depigmentation Surface Quantification
Hyperpigmentation Surface Quantification
Scar?
Dermatology Image Quality Assessment (DIQA)
Should we include this non-clinical AI? I guess we should.
Domain Validator
Should we include this non-clinical AI? I guess we should.
Fitzpatrick skin type detection
Should we include this non-clinical AI? I guess we should.
Data Specifications
The development of the algorithms requires the collection and annotation of dermatological images.
We defined three types of data to collect:
- Clinical Data: data with the diversity to be found in a hospital dermatology department (in terms of patients, demographics, skin tones, anatomical locations, and clinical indications).
- Atlas Data: data from online atlases or reference image repositories that provide a broader variability of cases and rare conditions, which might not be commonly encountered in everyday clinical practice but are necessary to strengthen the robustness of the algorithms.
- Evaluation Data: data specifically intended to enable unbiased training, validation, and evaluation of the algorithms.
To answer these specifications, three complementary data collections will be performed:
- Retrospective Data: data already available from dermatological atlases, hospital databases, or other private sources. These datasets include a wide variety of conditions, including rare diseases, and will be used to enhance diversity and improve training robustness.
- Prospective Data: data collected prospectively from hospital dermatology departments during routine clinical care. These images will ensure the dataset reflects real-world usage, patient demographics, and skin types, thereby supporting training, validation, and evaluation of the algorithms.
- Evaluation Data (Hold-out Sets): data specifically sequestered for independent testing and validation, ensuring unbiased performance assessment of the algorithms.
The collected data should reflect the intended population in terms of demographics, skin tones, anatomical regions, and dermatological parameters. A description of the population represented in the collected datasets will be presented in the R-TF-028-005 AI/ML Development Report.
Regarding annotation, multiple types of expert labeling will be performed depending on the model requirements which will be detailed in R-TF-028-004. Annotation will be performed exclusively by dermatologists, with adjudication steps to ensure consistency.
Methods to ensure data quality (both in collection and annotation), the sequestration of datasets, and the determination of ground truth will be implemented and documented.
The goal is to obtain data characterized by:
- Scale: [NUMBER OF IMAGES] dermatological images [cite: 51–53].
- Diversity: Representation of multiple skin tones, demographics, clinical contexts, and lesion types [cite: 54].
- Annotation: Expert dermatologists only, with inter-rater agreement checks [cite: 9, 10].
- Separation: Training, validation, and test sets with strict hold-out policies [cite: 68].
Requirements:
- Perform 1 retrospective and 2 prospective data collections.
- Provide evidence that collected data are representative of the intended population.
- Ensure complete independence of the test set from training/tuning datasets.
- Guarantee reproducible, consistent, and high-quality ground truth determination.
- Maintain data traceability, standardized labeling protocols, and robust quality control.
Other Specifications
Development Environment:
- Fixed hardware/software stack for training and evaluation.
- Deployment conversion validated by prediction equivalence testing.
Requirements:
- Track software versions (TensorFlow, NumPy, etc.).
- Verify equivalence between development and deployed model outputs.
Cybersecurity and Transparency
- Data: Always de-identified/pseudonymized [cite: 9].
- Access: Research server restricted to authorized staff only.
- Traceability: Development Report to include data management, model training, evaluation methods, and results.
- Explainability: Logs, saliency maps, and learning curves to support monitoring.
- User Documentation: Must state algorithm purpose, inputs/outputs, limitations, and that AI/ML is used.
Requirements:
- Secure and segregate research data.
- Provide full traceability of data and algorithms.
- Communicate limitations clearly to end-users.
Specifications and Risks
Risks linked to specifications are recorded in the AI/ML Risk Matrix (R-TF-028-011).
Key Risks:
- Misinterpretation of outputs.
- Incorrect diagnosis suggestions.
- Data bias or mislabeled ground truth.
- Model drift over time.
- Input image variability (lighting, resolution).
Risk Mitigations:
- Rigorous pre-market validation.
- Continuous monitoring and retraining.
- Controlled input requirements.
- Clear clinical instructions for use.
Integration and Environment
Integration
Algorithms will be packaged for integration into Legit.Health Plus to support healthcare professionals [cite: 20, 22, 25, 40].
Environment
- Inputs: Clinical and dermoscopic images [cite: 26].
- Robustness: Must handle variability in acquisition [cite: 8].
- Compatibility: Package size and computational load must align with target device hardware/software.
References
- Tan J, et al. Reliability of clinician erythema assessment grading scale. J Am Acad Dermatol. 2014;71(4):760–763. doi:10.1016/j.jaad.2014.05.037
- Lee JY, et al. Evaluation of erythema severity using convolutional neural networks. Sci Rep. 2021;11:7167. doi:10.1038/s41598-021-85489-8
- Cho Y, et al. Erythema scoring with deep learning in atopic dermatitis. Dermatol Ther (Heidelb). 2021;11:1227–1238. doi:10.1007/s13555-021-00541-9
- Kim H, et al. DeepErythema: Consistent evaluation of SPF index through deep learning. Sensors. 2023;23(13):5965. doi:10.3390/s23135965
- [TBD – Reference for ICD fine-grained CNN classification]
- [TBD – Reference for ICD attention-based models outperforming CNN-only]
- [TBD – Reference for attention-based interpretability in dermatology AI]
- [TBD – Reference for top-k prediction uplift ~10–15%]
- [TBD – Reference for comparative CNN baseline performance]
- Tschandl P, Rinner C, Apalla Z, et al. Human–computer collaboration for skin cancer recognition. Nat Med. 2020;26:1229–1234. doi:10.1038/s41591-020-0942-0
- Han SS, Park GH, Lim W, et al. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network. Br J Dermatol. 2020;182(2):480–488. doi:10.1111/bjd.18220
- Breitbart EW, Waldmann A, Nolte S, et al. Systematic skin cancer screening in Northern Germany. J Am Acad Dermatol. 2020;82(5):1231–1238. doi:10.1016/j.jaad.2019.09.055
- Krakowski AC, Sonabend AM, Smidt AC, et al. Artificial intelligence in dermatology: a systematic review of current applications and future directions. JAMA Dermatol. 2024;160(1):33–44. doi:10.1001/jamadermatol.2023.4597
- Salinas JL, Chen A, Kimball AB. Artificial intelligence for skin disease diagnosis: a systematic review and meta-analysis. Lancet Digit Health. 2024;6(2):e89–e101. doi:10.1016/S2589-7500(23)00265-5
- Puzenat E, et al. Assessment of psoriasis area and severity index (PASI) reliability in psoriasis clinical trials: analysis of the literature. Dermatology. 2010;220(1):15–19. doi:10.1159/000255439
- Cox JD, et al. Toxicity criteria of the Radiation Therapy Oncology Group (RTOG) and the European Organization for Research and Treatment of Cancer (EORTC). Int J Radiat Oncol Biol Phys. 1995;31(5):1341–1346. doi:10.1016/0360-3016(95)00060-C
- Phung SL, et al. Texture-based automated detection of scaling in psoriasis lesions using computer vision. Med Biol Eng Comput. 2019;57(3):503–516. doi:10.1007/s11517-018-1904-1
- Kim J, et al. Automated quantification of desquamation severity using convolutional neural networks in psoriasis. Comput Biol Med. 2022;142:105195. doi:10.1016/j.compbiomed.2022.105195
- NICE. Suspected cancer: recognition and referral. NICE guideline NG12. 2021.
- EADV Clinical Guidelines Committee. Triage of pigmented lesions in dermatology. Eur Acad Dermatol Venereol. 2020.
- Argenziano G, et al. Dermoscopy improves accuracy of primary care physicians in triaging pigmented lesions. Br J Dermatol. 2006;154(3):569–574. doi:10.1111/j.1365-2133.2005.07049.x
- Swetter SM, et al. Guidelines of care for the management of primary cutaneous melanoma. J Am Acad Dermatol. 2019;80(1):208–250. doi:10.1016/j.jaad.2018.08.055
- National Institute for Health and Care Excellence (NICE). Melanoma and non-melanoma skin cancer: diagnosis and management. 2022.
- Menzies SW, et al. Risk stratification of pigmented skin lesions: urgency in referral. Lancet Oncol. 2017;18(12):e650–e659. doi:10.1016/S1470-2045(17)30643-1
- Marsden J, et al. Revised UK guidelines for referral of suspected skin cancer. Br J Dermatol. 2010;163:238–245. doi:10.1111/j.1365-2133.2010.09709.x
- Morton C, et al. Variability in dermatology referrals and the role of AI-based triage. Clin Exp Dermatol. 2021;46:1051–1058. doi:10.1111/ced.14648
- Cai Y, Du D, Zhang L, Wen L, Wang W, Wu Y, Lyu S. Guided attention network for object detection and counting on drones. arXiv preprint. 2019:1909.11307.
- Wang Y, Hou J, Hou X, Chau LP. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans Image Process. 2021;30:2876–2887. doi:10.1109/TIP.2021.3055907
- Huynh QT, Nguyen PH, Le HX, Ngo LT, Trinh N-T, Tran MT-T, Nguyen HT, Vu NT, Nguyen AT, Suda K, et al. Automatic acne object detection and acne severity grading using smartphone images and artificial intelligence. Diagnostics. 2022;12(8):1879. doi:10.3390/diagnostics12081879
- Kimball AB, et al. Assessing severity of hidradenitis suppurativa: development of the IHS4. Br J Dermatol. 2016;174(5):1048–1052. doi:10.1111/bjd.14340
- Hasan MK, Ahamad MA, Yap CH, Yang G. A survey, review, and future trends of skin lesion segmentation and classification. Comput Biol Med. 2023;167:106624. doi:10.1016/j.compbiomed.2023.106624
- Mirikharaji Z, Abhishek K, Bissoto A, Barata C, Avila S, Valle E, Hamarneh G. A survey on deep learning for skin lesion segmentation. Med Image Anal. 2023;88:102863. doi:10.1016/j.media.2023.102863
- Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022;15:210. doi:10.1186/s13104-022-06079-9
- White N, Parsons R, Collins G, et al. Evidence of questionable research practices in clinical prediction models. BMC Med. 2023;21:339. doi:10.1186/s12916-023-03059-9
Traceability to QMS Records
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001