REQ_001 The user receives quantifiable data on the intensity of clinical signs
Category
Major
Source
- Tania Menendez, Digital Manager at Ribera Salud
- Dr. Marta Ruano, dermatologist at Hospital de Torrejón
- Dr. Ramon Grimalt, dermatologist at Grimalt Dermatología
Activities generated
- MDS-99
- MDS-173
- MDS-408
- MDS-169
Causes failure modes
- The AI models misinterpret the clinical signs in the images or miscalculate the intensity of clinical signs, leading to inaccurate data being presented.
- Poor quality or improperly taken images might lead to incorrect analysis and quantification.
- Issues with integrating scoring systems like SCORAD, PASI, or SALT could lead to incorrect severity quantification.
- Delays or timeouts in processing and delivering the clinical data could affect timely access to information.
Related risks
- Misrepresentation of magnitude returned by the device
- Misinterpretation of data returned by the device
- Incorrect clinical information: the care provider receives into their system data that is erroneous
- Incorrect diagnosis or follow up: the medical device outputs a wrong result to the HCP
- Incorrect results shown to the patient
- Sensitivity to image variability: analysis of the same lesion with images taken with deviations in lightning or orientation generates significantly different results
- Inaccurate training data: image datasets used in the development of the device are not properly labelled
- Biased or Incomplete Training Data: image datasets used in the development of the device are not properly selected
- Lack of efficacy or clinical utility
- Stagnation of model performance: the AI/ML models of the device do not benefit from the potential improvement in performance that comes from re-training
- Degradation of model performance: automatic re-training of models decreases the performance of the device
User Requirement, Software Requirement Specification and Design Requirement
- User Requirement 1.1: Users shall be provided with accurate and quantifiable data that quantify the intensity of clinical signs.
- Software Requirement Specification 1.2: The device shall process clinical sign data and generate an output which quantifies the intensity of clinical signs.
- Design Requirement 1.3: The device responses containing data on the intensity of clinical signs shall be structured with key-value pairs, utilising standard medical terminology and coding as key names.
Description
Skin conditions are distinguished by a spectrum of observable manifestations and associated symptoms. These visual indicators, including but not limited to erythema, are pervasive across a diverse range of dermatological pathologies; however, their presence is not uniformly evident in all instances of a given pathology. These visual manifestations often coincide with accompanying symptoms such as fever, pruritus, and more.
The quantification of the severity of these visual indicators constitutes a crucial endeavour in comprehending the extent of a particular pathology. Typically, this quantification process employs a rating scale, often ranging from 0 to 3 or 0 to 4, for ease of human evaluation. While scales extending from 0 to 10 are infrequently employed for visual indicators, they may find utility in assessing symptoms such as pruritus.
It is imperative to acknowledge that gauging the intensity of these visual indicators is inherently subjective, contingent upon the observer's ability to discern how the visual signs manifest on the skin. Consequently, greater expertise on the part of the observer typically yields more accurate results, albeit retaining a subjective element. This subjectivity gives rise to inter-observer variability, leading to variances of approximately 10-20% in the quantification of the intensity of visual signs like erythema, induration, swelling, and others when assessed by different specialists. This inter-observer variability introduces a degree of unreliability in the severity assessment process, necessitating the observation of a notable improvement or deterioration to ensure the validity of the quantification.
Addressing this imperative, there arises the need for the development of a suite of algorithms designed to automate the quantification of the following visual indicators:
- Erythema
- Induration
- Desquamation
- Edema
- Oozing
- Excoriation
- Lichenification
- Dryness
- Pustulation
- Exudation
- Edges
- Affected tissues
- Facial palsy
The visual signs mentioned above are among the most frequently encountered indicators used in the assessment of dermatosis and facial nerve injury.
Automating the quantification of visual signs entails a structured approach, typically divided into two principal phases, mirroring the common workflow of data science projects: data annotation and algorithm development.
Data annotation
The first phase, data annotation, is essential to understanding the inter-observer variability and serves as the foundation for training the algorithm. During this step, medical professionals carefully review individual images and assign intensity values to the specific visual signs we are interested in. In other words, the professionals annotate the images to generate classification labels (either categorical or ordinal) for the later training of the deep learning models.
The key here is selecting the right medical experts and determining an appropriate group size. We typically engage a minimum of three experienced physicians for this task, and for more complex assignments, a larger panel is preferred. We have enlisted the expertise of three doctors who specialise in assessing the severity of pathologies commonly associated with the visual signs we are studying, specifically atopic dermatitis and psoriasis.
By pooling the assessments of these three experts, we establish a ground truth dataset. This dataset serves a dual purpose: it becomes the foundation for training our algorithms and also allows us to gauge inter-observer variability, a critical measure of the accuracy and consistency of our measurements.
Algorithm development
The next phase involves the development of the algorithms, which will rely on the ground truth data collected during the previous stage. The outcomes generated by the algorithms will then be juxtaposed with the measured variability. This step is particularly crucial since tasks of this complexity, prone to inherent variability, necessitate comparison with the prevailing baseline or state-of-the-art standards. This comparative analysis is essential for validating the algorithm's performance accurately.
It's worth emphasizing that the convolutional neural networks we are training will assimilate knowledge from the collective expertise of specialists. It's important to acknowledge that a significant subjective element is inherent in this process, given the nuanced nature of the task.
When possible, the dataset used to develop each algorithm is split into training, validation, and test sets. However, when the sample size is limited, the data is split into training and validation only to ensure each set contains enough data.
Success metrics
To accurately gauge the performance of our models, it's essential to recognise that we are dealing with an ordinal classification problem in all instances. Although the labels in this context are categorised as numbers, the output of deep learning models for classification can take two forms. It can be an integer, obtained by applying the argmax function, or it can be continuous, achieved by multiplying each category by its probability and summing the results. Given the nature of this problem, we use as evaluation metrics ones used on state-of-the-art ordinal classification solutions [^1] [^2]. This is the case of the Mean Absolute Error (MAE) or Mean Squared Error (MSE), the most appropriate approach for evaluating most types of visual signs.
However, we must consider that the range for each visual sign can vary, as they are not all quantified on the same scale. This scaling is grounded in clinical evidence. Consequently, interpreting the metric in a percentage format, like the Relative MAE (RMAE), is beneficial for comparative performance analysis. This rationale underpins our decision to use RMAE for most visual sign intensity quantifications.
Besides that, the selection of thresholds should also align with specialist annotator performance. It is important to note that these thresholds reflect the necessity for the medical device to achieve performance comparable to that of specialist annotators, without requiring it to surpass them significantly. The intention behind selecting the thresholds is to ensure the device's results are on par with human expert performance, thereby meeting clinical utility expectations. Setting the thresholds below the observed specialist annotators' performance (e.g., the average RMAE of specialist annotators across the three datasets mentioned in [3]) would create an unrealistic expectation that the device consistently outperforms human experts. This rationale underscores the balance between maintaining clinical relevance and accounting for practical device limitations.
Furthermore, relying solely on mean metrics from a particular dataset is insufficient to establish an appropriate threshold. Statistical variability, as represented by the standard deviation, must also be considered (e.g., as observed in the "Legit.Health-AD" dataset in [3]) when defining specific thresholds.
To better understand the specific value of 20%, let's consider the most prevalent categories, ranging from 0 to 4. In this scenario, a random assessment by multiple annotators would yield an average error rate of about 40% under a uniformly distributed intensity scenario. However, if the distribution peaks at intensity values of 2-3, which is typical in real-world data, the average error rate naturally falls to approximately 30-35%. Drawing on this observation, and considering the variability analysis of annotators who typically achieve accuracy rates between 12% and 20% [3], we opted to set our threshold at 20%. This threshold is chosen with the objective of achieving results that are on par with, or surpass, those of specialists.
When intensity is quantified in different terms, such as ordinal categories without a defined distance between them, we choose the most common (in classification problems) accuracy metric due to its suitability. In particular, we calculate the Balanced Accuracy. This metric accounts for performance across all categories, rather than relying solely on standard accuracy, which can be skewed by an imbalanced dataset—a frequent occurrence in clinical datasets.
Upon reviewing the variability analysis, we determined that a 60% threshold is fitting as it matches the evaluations of the majority of experts. Four wound care specialists assessed factors such as edge intensity, tissue type, and exudate levels, yielding accuracy rates between 50% and 90%. Considering the performance of the most experienced annotator, who approached the consensus level of around 75%, we deemed an 80% threshold appropriate for the model.
Goal | Metric | Target |
---|---|---|
The automatic intensity quantification of erythema is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of induration is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of desquamation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of oedema is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of oozing is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of excoriation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of lichenification is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of dryness is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of pustulation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of exudation is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of edges is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of affected tissues is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of facial palsy is at expert consensus level | RMAE | < 20% |
Model bias
The primary causes of bias in the model include:
- Skin tone bias: The model may exhibit reduced accuracy for individuals with certain skin tones, potentially due to insufficient representation of diverse skin types in the training dataset.
- Image condition bias: The model may perform inconsistently for images captured with different cameras or under varying lighting conditions, as these factors can influence image quality and appearance.
To address potential bias related to skin tone, we prioritize incorporating images from dermatology atlases that span all Fitzpatrick skin types whenever possible. These atlases provide a comprehensive range of skin tones, from Fitzpatrick types I to VI, enabling us to assess and mitigate model bias across diverse skin tones. For instance, in the retrospective study ([3]), the model was evaluated using datasets with distinct Fitzpatrick ranges: one dataset included skin tones from types I to III, while another focused on types IV to VI. This evaluation, detailed in the article, includes quantitative results and statistical analyses that identify and document any biases affecting the model's performance ([3], Results, Tables 5-9).
Furthermore, we construct datasets from public dermatology atlases containing images captured in real clinical settings. These images offer a wide diversity in perspectives, lighting conditions, and camera quality. This variability not only aids the model in learning to generalize effectively across different imaging scenarios but also ensures that performance evaluations reflect real-world conditions. By leveraging such diverse datasets, we minimize biases introduced by specific imaging technologies or settings and provide robust assessments of the model's performance in varied clinical contexts.
Model robustness
We apply rigorous cross-validation methods, including k-fold cross-validation, to evaluate the model's performance across multiple datasets (example in [3] Experimental Setup). This process allows us to assess the model's generalizability and its robustness to variations in data distribution. Additionally, it provides quantitative measures of robustness, such as confidence intervals for key performance metrics (example in [3], Results, Tables 5-9). These confidence intervals help quantify the reliability of the model under different conditions.
During training, we apply a variety of data augmentation techniques to simulate real-world imaging conditions. These include random rotations, flips, zooms, lighting adjustments, and other transformations applied to the input images. This approach ensures the model learns to generalize effectively across diverse imaging scenarios, reducing dependence on specific conditions in the training dataset. By introducing these variations, we enhance both the model's robustness and its generalization capabilities.
References
[1] Cao, W., Mirjalili, V., & Raschka, S. (2020). Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140, 325-331.
[2] SHI, Xintong; CAO, Wenzhi; RASCHKA, Sebastian. Deep neural networks for rank-consistent ordinal regression based on conditional probabilities. Pattern Analysis and Applications, 2023, vol. 26, no 3, p. 941-955.
[3] Medela A, Mac Carthy T, Aguilar Robles SA, Chiesa-Estomba CM, Grimalt R. Automatic SCOring of Atopic Dermatitis Using Deep Learning: A Pilot Study. JID Innov. 2022 Feb 11;2(3):100107. doi: 10.1016/j.xjidi.2022.100107. PMID: 35990535; PMCID: PMC9382656.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: JD-004, JD-005, JD-009, JD-017
- Approver: JD-003