REQ_003 The user receives quantifiable data on the extent of clinical signs
Category​
Major
Source​
- Dr. Gastón Roustan, dermatologist at Hospital Puerta de Hierro
- Dr. Ramon Grimalt, dermatologist at Grimalt DermatologÃa
- Dr. Sergio Vaño, dermatologist at Hospital Ramón y Cajal
USER SYSTEM INPUTS AND OUTPUTSDATABASE AND DATA DEFINITIONARCHITECTURE
Activities generated​
- MDS-100
- MDS-99
- MDS-173
- MDS-393
Causes failure modes​
- The AI models might misinterpret or miscalculate the extent of the clinical signs in the images, leading to an incorrect assessment of their extent.
- Poor quality or improperly taken images might result in incorrect analysis and quantification of the extent of clinical signs.
- Delays or timeouts in processing and delivering the extent data could affect timely access to accurate information.
- Incorrect units of measurement for the extent of clinical signs might be used or displayed.
- The AI might incorrectly identify the boundaries of the clinical signs, leading to an inaccurate assessment of their extent.
- Variations in lighting, angle, or distance in the images might affect the AI's ability to accurately determine the surface area of clinical signs.
Related risks​
-
- Misrepresentation of magnitude returned by the device
-
- Misinterpretation of data returned by the device
-
- Incorrect clinical information: the care provider receives into their system data that is erroneous
-
- Incorrect clinical information: the care provider receives into their system data that is erroneous
-
- Incorrect results shown to patient
-
- Sensitivity to image variability: analysis of the same lesion with images taken with deviations in lightning or orientation generates significantly different results
-
- Inaccurate training data: image datasets used in the development of the device are not properly labeled
-
- Biased or incomplete training data: image datasets used in the development of the device are not properly selected
-
- Lack of efficacy or clinical utility
-
- Stagnation of model performance: the AI/ML models of the device do not benefit from the potential improvement in performance that comes from re-training
-
- Degradation of model performance: automatic re-training of models decreases the performance of the device
User Requirement, Software Requirement Specification and Design Requirement​
- User Requirement 3.1: Users shall receive data quantifying the spatial extent of identified clinical signs.
- Software Requirement Specification 3.2: Algorithms shall analyze and quantify the spatial distribution and size of clinical signs.
- Design Requirement 3.3: Data on the extent of clinical signs shall be returned in a FHIR-compliant format, ensuring consistent communication and interoperability.
Description​
Quantifying lesions and assessing their intensity plays a pivotal role in evaluating the severity of skin conditions. However, these metrics may not always suffice. Some visual signs manifest as plaques, which can vary in their extent across the body, directly impacting the overall severity. For instance, a dry patch covering 10% of the body surface area (BSA) may expand to affect 20% BSA, encompassing a larger portion of the skin.
Accurately quantifying the surface area of these visual signs is a complex task, even for seasoned dermatologists. Yet, it is indispensable for assessing the severity of numerous skin pathologies, including atopic dermatitis, psoriasis, and alopecia. Determining the surface area in relation to the entire body is subjective when relying solely on the human eye.
To tackle this challenge, there is an urgent need to develop a suite of algorithms tailored to automate quantifying the surface area of various visual signs, such as:
- Erythema
- Induration
- Desquamation
- Edema
- Oozing
- Excoriation
- Lichenification
- Dryness
- Hair density
- Wounds
- Maceration
- Necrotic tissue
- Bone tissue
The automation of surface area quantification follows a systematic approach, typically divided into two primary phases, mirroring the standard workflow of data science projects: data annotation and algorithm development.
Data annotation​
The initial phase, data annotation, stands as a critical pillar in comprehending inter-observer variability and lays the groundwork for algorithm training. In this stage, medical professionals meticulously evaluate individual images and delineate the surface area with the aid of a polygon tool. All professionals are given some training before annotation to ensure they understand the task for which data needs to be labeled.
A pivotal consideration in this phase involves the selection of the right medical experts and the determination of an optimal team size. Typically, we engage a minimum of three seasoned physicians for this task. To ensure the precision required for diverse pathologies, we have assembled a trio of specialists for atopic dermatitis and psoriasis, another four for wounds, and an additional trio for alopecia, as each pathology necessitates distinct expertise.
By pooling the assessments of these experts, we establish a ground truth dataset. This dataset serves a dual purpose: it becomes the foundation for training our algorithms and also allows us to gauge inter-observer variability, a critical measure of the performance and consistency of our measurements.
Algorithm development​
The next phase involves the development of the algorithms, which will rely on the ground truth data collected during the previous stage. The outcomes generated by the algorithms will then be juxtaposed with the measured variability. This step is particularly crucial since tasks of this complexity, prone to inherent variability, necessitate comparison with the prevailing baseline or state-of-the-art standards. This comparative analysis is essential for validating the algorithm's performance accurately.
It's worth emphasizing that the convolutional neural networks we are training will assimilate knowledge from the collective expertise of specialists. It's important to acknowledge that a significant subjective element is inherent in this process, given the nuanced nature of the task.
When possible, the dataset used to develop each algorithm is split into training, validation, and test sets. However, when the sample size is limited, the data is split into training and validation only to ensure each set contains enough data.
Success metrics​
The efficacy of automatic quantification of clinical signs, which relies solely on visual data from 2D images, is intrinsically linked to the capability of accurately capturing these signs on camera. By carefully reviewing the evaluation methodologies in state-of-the-art medical solutions [1, 2], and taking into account that some clinical signs are more visually prominent than others, we tailored the choice of metrics and success thresholds for each task to align with its inherent level of difficulty.
We utilize the IoU and the AUC metrics due to their widespread acceptance in diagnostic medicine to provide a quantitative measure for assessing the model performance in medical imaging.
We have determined that an Intersection over Union (IoU) value of 0.5 serves as the minimum benchmark for surpassing expert performance in quantifying hair density, wounds, maceration, necrosis, and bone extent. This threshold is based on the complexity of these tasks and research conducted with expert dermatologists and nurses, supported by scientific evidence from multiple experts. The IoU metric measures the overlap between predicted bounding boxes and ground truth boxes, with scores ranging from 0 to 1.
For tasks that are comparatively less complex, we have adopted a more stringent criterion, setting an Area Under the Curve (AUC) threshold of 0.8. The AUC value ranges from 0 to 1, where an AUC of 1 represents a perfect model that can perfectly distinguish between the classes, and an AUC of 0.5 indicates a model with no predictive power (equivalent to random classification). Typically, AUC values above 0.8 are considered good, while values below 0.7 are generally unsatisfactory [3]. This threshold ensures a higher level of accuracy and reliability in our model's predictions.
Goal | Metric |
---|---|
The automated quantification of erythema, induration and desquamation surface area achieves an expert consensus level of performance | Area Under the Curve (AUC) greater than 0.8 |
The automated quantification of erythema, edema, oozing, excoriation, lichenification and dryness surface area achieves an expert consensus level of performance | Area Under the Curve (AUC) greater than 0.8 |
The automated quantification of hair density achieves an expert consensus level of performance | Intersection over Union (IoU) greater than 0.5 |
The automated quantification of wounds achieves an expert consensus level of performance | Intersection over Union (IoU) greater than 0.5 |
The automated quantification of maceration achieves an expert consensus level of performance | Intersection over Union (IoU) greater than 0.5 |
The automated quantification of necrotic tissue achieves an expert consensus level of performance | Intersection over Union (IoU) greater than 0.5 |
The automated quantification of bone achieves an expert consensus level of performance | Intersection over Union (IoU) greater than 0.5 |
[1] Hasan, M. K., Ahamad, M. A., Yap, C. H., & Yang, G. (2023). A survey, review, and future trends of skin lesion segmentation and classification. Computers in Biology and Medicine, 106624.
[2] Mirikharaji, Z., Abhishek, K., Bissoto, A., Barata, C., Avila, S., Valle, E., ... & Hamarneh, G. (2023). A survey on deep learning for skin lesion segmentation. Medical Image Analysis, 102863.
[3] Müller, D., Soto-Rey, I. & Kramer, F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes 15, 210 (2022). https://doi.org/10.1186/s13104-022-06096-y
[4] White, N., Parsons, R., Collins, G. et al. Evidence of questionable research practices in clinical prediction models. BMC Med 21, 339 (2023). https://doi.org/10.1186/s12916-023-03048-6
Previous related requirements​
- REQ_001
- REQ_002
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Tester: JD-017, JD-009, JD-005, JD-004
- Approver: JD-003