TEST_002 The user receives quantifiable data on the count of clinical signs
Test type​
System
Linked activities​
- MDS-98
- MDS-102
- MDS-101
Result​
- Passed
- Failed
Description​
Tests carried out at the automatic visual sign counting algorithms, to verify that the performance is comparable to that of an expert dermatologist.
Objective​
The goal is to prove that the quantifiable data on the count of clinical signs received by the user is extracted with expert dermatologist level performance.
Acceptance criteria​
Nodules, abscesses, and draining tunnels detection algorithm's Mean Absolute Error (MAE) must be smaller than annotator's MAE or less than a 10% variance compared to that of individual annotators. Hive detection's precision and recall rates must exceed 50%. Inflammatory lesion detection algorithm's precision and recall rates must exceed 70%.
Materials & methods​
Ground truth generation​
A total of 2,012 images were utilized, with 221 pertaining to hidradenitis suppurativa, 1,457 to acne, and 334 to urticaria. Each image set was evaluated by different dermatology experts specializing in their respective conditions, with 6 experts for hidradenitis and 5 for urticaria. No additional annotators were required for acne, as the dataset (ACNE04) was already labeled.
To determine whether the dataset is adequate to meet our objectives, we first evaluate the complexity of the task at hand. This evaluation involves a thorough analysis of medical evidence, supplemented by a variability study conducted in collaboration with experienced doctors (expanded in the papers provided as evidence). These doctors possess an in-depth understanding of the nuances of the problem, enabling us to establish a robust baseline for our analysis.
When applying these criteria to ACNE04, a dataset that includes pre-existing labels, there is already substantial evidence provided by the authors. They assert, "This indicates that the label distribution of the lesion number possesses the inherent potential to represent the continuous features of acne images effectively." Furthermore, they note, "The observed improvements also validate the feasibility and capability of using computer vision to discriminate between varying degrees of acne severity by counting lesions." Additionally, it has been established that this method not only achieves performance on par with dermatologists but also surpasses that of two dermatologists to a certain extent.
The remaining subsets (hidradenitis suppurativa, urticaria) did not have any associated labels. To establish the ground truth, each case was evaluated by a board of dermatologists, and we devised several methods that combined the annotations from these specialists.
- For the hidradenitis suppurativa dataset, we developed a four-stage aggregation algorithm, with a small tweak in favor of the most experienced and best-performing specialists. This method, which we called knowledge unification, made it possible to fuse multi-label annotations from several specialists into a final set of labels that constitute the consensus of the board.
- For the urticaria dataset, we developed a different method as we only had to deal with single-class labels (hives). In broad terms, our method consists of considering every box individually and, based on the level of similarity between annotators, combining them to obtain a final set of boxes that reflects the highest consensus. In other words, each of the bounding boxes in any given image is transformed into a Gaussian distribution, and therefore, all boxes are combined in a weighted sum based on the compared performance of their annotators.
These label fusion methods are outlined in detail in the attached papers in the evidence section. The result of these methods is a set of accurate bounding boxes that can be used for training object detection models.
Data splitting​
After generating the ground truth labels, the datasets we split into training and validation splits. Due to the limited size of the datasets, we preferred a train/validation split over a train/validation/test setup to ensure that the validation contained enough images to provide reliable metrics. To maximize the usefulness of this train/val approach, we conducted a K-fold cross-validation for each case. K-fold cross-validation is the most recommended approach when working with limited data, to estimate the skill of a machine learning model on unseen data. The results provided are the average metrics obtained from cross-validation.
We manually reviewed the hidradenitis suppurativa and urticaria datasets to ensure the correct stratification of the images patient-wise to prevent data leakage. For the ACNE04 dataset, we used the train/validation splits provided by the authors, which are also stratified patient-wise.
Model training​
Regarding the model, object detection involves identifying specific object instances within images. State-of-the-art methods can be broadly categorized into two types: one-stage methods and two-stage methods. Single-stage methods prioritize inference speed and encompass models like YOLO, SSD, and RetinaNet. In contrast, two-stage methods prioritize detection accuracy and include models such as Faster R-CNN, Mask R-CNN, and Cascade R-CNN. For our study, we selected the YOLO architecture. Specifically, we used YOLOv5, an open-source implementation widely embraced by the machine learning community. YOLOv5 offers various architectures with an increasing number of parameters: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.
We fine-tuned several YOLOv5 models for the detection of nodules, abscesses, draining tunnels, hives, and inflammatory lesions. To do so, we defined three main tasks and trained and validated their corresponding models: one YOLOv5 model for nodule, abscess, and draining tunnel detection; one YOLOv5 model for hive detection; and one YOLOv5 model for inflammatory lesion detection. We followed a transfer learning strategy, which means that we used model weights obtained from pre-training the models on large-scale image datasets (ImageNet) as the starting point for the model to learn from the new dermatology-related tasks. The rest of the training and validation protocol was the default for the open-source YOLOv5 implementation.
Once each model is trained, it can be later used to automatically detect and count different signs to compute the specific severity scores (such as UAS and IHS4), and then categorize a case as “clear“, “mild”, “moderate”, and “severe” accordingly. The exact logic used to create severity stages can be found in our publications.
The processors we assessed in this test are AUAS, AIHS4, NSIL, and ALADIN.
Results​
Nodule, abscess, and draining tunnel detection​
The YOLOv5 offers compelling performance in detecting nodules, abscesses, and draining tunnels. Once the model has detected all the nodules, abscessed, and draining tunnels in an image, the IHS4 score is calculated from the total number of detections. This predicted IHS4 score is compared to the ground truth IHS4 score to obtain the mean absolute error (MAE). For instance, the YOLOv5x model achieves MAE values of 2.16, 3.37, and 5.26 for mild, moderate, and severe cases, respectively, compared to the doctors' scores of 2.04, 3.01, and 4.88.
When considering all three severity categories, the dermatologists' average variance is 1.57, while the algorithm's average variance stands at 0.096. Notably, the algorithm's variance is approximately 6% of the annotators' variance, which is well below the expected 10% metric deviation.
Inflammatory lesion detection​
In terms of inflammatory lesion detection, the algorithm's Mean Absolute Error (MAE) of 5.56 is notably lower than the specialists' average MAE of 7.5. Although the algorithm exhibits a higher variance compared to dermatologists, this variance leans toward the "positive" side, signifying that the algorithm's performance is superior. In this case, a higher variance is indicative of improved performance.
Concerning the detection of inflammatory lesions, our target is to achieve precision and recall rates exceeding 70%. The metrics are outlined in the table. On average, the top-performing model, YOLOv5 (m), attains a precision of 73.5% and a recall of 74.17%, surpassing our expectations.
Precision (P) | ||||
Model | Mild | Moderate | Severe | Very severe |
YOLOv5 (s) | 0.866 ± 0.045 | 0.818 ± 0.044 | 0.590 ± 0.088 | 0.648 ± 0.186 |
YOLOv5 (m) | 0.846 ± 0.037 | 0.824 ± 0.046 | 0.594 ± 0.041 | 0.676 ± 0.073 |
YOLOv5 (l) | 0.830 ± 0.078 | 0.819 ± 0.051 | 0.600 ± 0.158 | 0.618 ± 0.158 |
YOLOv5 (x) | 0.781 ± 0.136 | 0.802 ± 0.034 | 0.563 ± 0.041 | 0.543 ± 0.156 |
Recall (R) | ||||
Model | Mild | Moderate | Severe | Very severe |
YOLOv5 (s) | 0.797 ± 0.063 | 0.845 ± 0.045 | 0.736 ± 0.111 | 0.549 ± 0.199 |
YOLOv5 (m) | 0.802 ± 0.054 | 0.816 ± 0.038 | 0.770 ± 0.071 | 0.579 ± 0.103 |
YOLOv5 (l) | 0.789 ± 0.072 | 0.804 ± 0.075 | 0.738 ± 0.067 | 0.565 ± 0.197 |
YOLOv5 (x) | 0.795 ± 0.073 | 0.730 ± 0.096 | 0.693 ± 0.108 | 0.512 ± 0.257 |
Hive detection​
Regarding hives, the best-performing model achieved an average precision of 68% and a recall of 57%, although all architectures performed similarly. With regards to the mAP@0.5 metric (which means “mean Average Precision at an intersection over union (IoU) threshold of 0.50”), we obtained an average value beyond 0.60, which also reflects strong performance and is a success given the inherent difficulty of the task.
Model | Precision | Recall | mAP |
yolov5s | 0.669 ± 0.045 | 0.578 ± 0.039 | 0.615 ± 0.049 |
yolov5m | 0.684 ± 0.039 | 0.571 ± 0.054 | 0.617 ± 0.064 |
yolov5l | 0.682 ± 0.071 | 0.570 ± 0.046 | 0.618 ± 0.068 |
yolov5x | 0.681 ± 0.043 | 0.578 ± 0.055 | 0.621 ± 0.065 |
Protocol deviations​
There was no need to annotate acne images same as for the other conditions.
Conclusions​
The quantifiable data regarding the count of clinical signs provided to the user matches the expertise of dermatologists. This ensures the quality of the training datasets (size and consistency), offering healthcare practitioners the best information to support their clinical assessments.
Test checklist​
The following checklist verifies the completion of the goals and metrics specified in the requirement REQ_002.
Requirement verification |​
- Automatic inflammatory lesion counting algorithm precision and recall are greater than 70% |
- Automatic hive counting algorithm precision and recall are greater than 50% |
- Automatic nodule counting algorithm MAE is less than annotators MAE or has less than a 10% variance compared to that of individual annotators |
- Automatic abscess counting algorithm MAE is less than annotators MAE or has less than a 10% variance compared to that of individual annotators |
- Automatic drainning tunnel counting algorithm MAE is less than annotators MAE or has less than a 10% variance compared to that of individual annotators |
Evidence​
Evidence is publicly available at https://pubmed.ncbi.nlm.nih.gov/37357665/ and https://www.sciencedirect.com/science/article/pii/S2667026723000437 for the hive, nodule, abscess and draining tunnel counting algorithm. Scientific papers are also attached:
Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A novel tool to assess the severity of hidradenitis suppurativa using artificial intelligence
Ignacio Hernández Montilla, Alfonso Medela, Taig Mac Carthy, Andy Aguilar, Pedro GĂłmez Tejerina, Alejandro Vilas Sueiro, Ana MarĂa González PĂ©rez, Laura Vergara de la Campa … See all authors
Published: 05 June 2023
DOI https://doi.org/10.1111/srt.13357
Automatic Urticaria Activity Score: Deep Learning–Based Automatic Hive Counting for Urticaria Severity Assessment
Taig Mac Carthy, Ignacio Hernández Montilla, Andy Aguilar, Laura Vergara de la Campa, Fernando Alfageme 8 Alfonso Medela 8
Published: July 11, 2023
DOI: https://doi.org/10.1016/j.xjidi.2023.100218
Regarding the inflammatory lesion counting algorithm, a draft is attached:
ALADIN: Automatic Lesion And Density INdex. A Novel Tool for Automatic Acne Severity Assessment
Alfonso Medela1,*, Ignacio Hernández Montilla1, Taig Mac Carthy2,,+, Andy Aguilar1,*, Sofia Vera Carretero1, JosĂ© Luis LĂłpez-Estebaranz3,+, Constanza Balboni3,+, Javier MartĂn Alcalde3,+, JosĂ© Luis RamĂrez Bellver4,+, Alejandro MartĂn Gorgojo5,+ and Pedro Rodriguez JimĂ©nez
Not published
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Tester: JD-017, JD-009, JD-004
- Approver: JD-005