TEST_001 The user receives quantifiable data on the intensity of clinical signs

Test type

System

Linked activities

MDS-100
MDS-99
MDS-173
MDS-408

Result

Passed
Failed

Description

Tests carried out with the automatic visual sign intensity quantification algorithms, to verify that the performance is comparable to that of an expert dermatologist.

Objective

The goal is to prove that the quantifiable data on the intensity of clinical signs received by the user is extracted with expert dermatologist-level performance.

Acceptance criteria

Relative Mean Absolute Error (RMAE) must be smaller than 20%.

Materials & methods

Ground truth generation

We employed a total of 5,459 images, each depicting dermatosis and displaying the visual signs specified in the requirement. These image sets underwent evaluation by distinct dermatology experts, enabling us to conduct a variability analysis. Each expert annotated the data by assigning a label (either ordinal or categorical) to every image. Subsequently, the algorithm learned from the consensus among these experts, which served as the gold standard for training.

In order to assess if the dataset size is adequate, we begin by calculating RMAE values, which measure the variability or error in the data. The RMAE metric provides insight into the variability of annotations across annotators for each instance in the dataset. A higher RMAE indicates greater variability in annotations for a particular instance, suggesting ambiguity or complexity in the task. Conversely, a lower RMAE suggests more consistent annotations, indicating clarity in the task or data. We plot these RMSE values against the dataset size. As the dataset size increases, the standard deviation of these values fluctuates, initially high but gradually decreasing. By observing the standard deviation within a window of three tests, and once it falls below a threshold of 0.02, it indicates that the dataset's variability has stabilised, suggesting that further additions to the dataset are unlikely to significantly impact performance. This approach is asserted to be consistent across different datasets, providing a reliable means of identifying an appropriate dataset size for the task at hand.

To establish the ground truth, we employed mean and median statistics, which are widely recognized and established in the literature as the most suitable methods for constructing ground truth when working with ordinal categorical data. After generating the ground truth labels, the dataset was split to obtain a training, a validation, and a test set for each task. As there was no patient-related metadata available, the data was split randomly using K-fold cross-validation. The reported result metrics are computed on the validation sets.

To determine whether the dataset is sufficiently large to meet our objectives, we first evaluate the complexity of the tasks at hand. This evaluation involves a thorough analysis of medical evidence, supplemented by a variability study conducted in collaboration with experienced doctors. These doctors possess an in-depth understanding of the nuances of each problem, enabling us to establish a robust baseline for our analysis.

In this scenario, we have computed crucial metrics for the annotated dataset, specifically focusing on the Relative Mean Absolute Error (RMAE) and the Relative Standard Deviation (RSD). The results from the doctors yielded values approximately at 13.8% for RMAE and 12.56% for RSD. While these figures reflect a moderate level of agreement among the experts, such a degree of concordance is anticipated for a problem with this complexity. These metrics not only imply that the dataset's size is adequate, but they also establish a foundational benchmark for our computer vision algorithms.

Model training and evaluation

Regarding the model, we trained several multioutput classifiers, one for each task. Each classifier consists of a single deep-learning backbone and several classification heads, one per visual sign. We used the EfficientNet-B0 network architecture that was pre-trained on approximately 1.28 million images (1,000 object categories) from the 2014 ImageNet Large Scale Visual Recognition Challenge and trained it on our dataset using transfer learning. EfficientNets achieve better accuracy and efficiency than previous convolutional neural networks with fewer parameters by applying a new scaling method that uniformly scales all dimensions of depth/ width/resolution using a simple yet highly effective compound coefficient. There are eight versions, consisting of a different number of parameters, with the B0 being the smallest network that achieves state-of-the-art 77.1% top-1 accuracy on ImageNet for a network consisting of 5 million parameters. Regarding the transfer learning strategy, all the models undergo the same training: first, we freeze all layers (except for the last linear layer) and train the model for several epochs; then we unfreeze all layers and fine-tune the entire model for another number of epochs.

In summary, we implemented one model for each of the following tasks, each task including one or more visual signs:

Estimation of intensity of erythema, edema, oozing, excoriation, lichenification, and dryness
Estimation of intensity of erythema, induration and desquamation
Estimation of erythema, pustulation, and desquamation
Estimation of erythema, exudation, edges, affected tissues
Estimation of facial palsy

The processors we assessed in this test are APULSI, APASI, NSIL and ASCORAD.

Results

The algorithms' performance is remarkable. It maintains an average Relative Mean Absolute Error (RMAE) of just 13% for assessing erythema, edema, oozing, excoriation, lichenification, and dryness. Furthermore, all individual RMAE values remain below the 20% threshold. In the cases of induration, desquamation, and pustulation, RMAE values are also comfortably below 20%, meeting the predetermined criteria. For the evaluation of edges, exudation and affected tissues, the outcome demonstrates a balanced accuracy of 64%, 74% and 69%, signifying strong performance. When measuring facial palsy, the algorithm excels with an RMAE of only 9%, further emphasizing its exceptional performance.

Protocol deviations

There were no deviations from initial protocol.

Conclusions

The quantifiable data regarding the intensity quantification of clinical signs provided to the user matches the expertise of dermatologists. This ensures the quality of the data, offering healthcare practitioners the best information to support their clinical assessments.

Test checklist

The following checklist verifies the completion of the goals and metrics specified in the requirement REQ_001.

Requirement verification

Evidence

Evidence for erythema, edema, oozing, excoriation, lichenification and dryness can be found in the following article:

Automatic SCOring of Atopic Dermatitis Using Deep Learning: A Pilot Study

Alfonso Medela, Taig Mac Carthy, S. Andy Aguilar Robles, Carlos M. Chiesa-Estomba, Ramon Grimalt

Published: February 10, 2022

DOI: https://doi.org/10.1016/j.xjidi.2022.100107

Evidence for induration, desquamation and pustulation can be found in the following drafts containing the whole procedure:

APASI: Automatic Psoriasis Area Severity Index Estimation using Deep Learning

Alfonso Medela1, *, Taig Mac Carthy1,** ,+, Andy Aguilar1,***, Pedro G ́omez-Tejerina1,****, Carlos M Chiesa-Estomba2,3,4, Fernando Alfageme-Rold ́an5,+, and Gaston Roustan Gull ́on5,+

1 Department of Medical Computer Vision and PROMs, LEGIT.HEALTH, 48013, Bilbao, Spain

2 Department of Otorhinolaryngology, Osakidetza, Donostia University Hospital, 20014 San Sebastian, Spain

3 Biodonostia Health Research Institute, 20014 San Sebastian, Spain

4 Head Neck Study Group of Young-Otolaryngologists of the International Federations of Oto-rhino-laryngological Societies (YO-IFOS), 13005 Marseille, France

5 Servicio de Dermatología, Hospital Puerta de Hierro, Majadahonda, Madrid, Spain

Not published at the time of writing this

In the following confusion matrix we see the performance on pustules:

pustules performance

Evidence of an outcome for exudation intensity, edges intensity, affected tissue intensity:

Evaluator	Visual sign intensity
Model	Edges: Delimited Exudation: Fibrinous Exudate type: Serous Affected tissues: Affection of bone and/or adnexal tissues.
Annotator	Edges: Delimited Exudation: Fibrinous Exudate type: Serous Affected tissues: Affection of bone and/or adnexal tissues.

The sample image is the following:

sample image | Edges: Delimited

Exudation: Fibrinous

Exudate type: Serous

Affected tissues: Affection of bone and/or adnexal tissues. |

Evidence of facial palsy intensity quantification:

facial palsy

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Tester: JD-017, JD-009, JD-004
Approver: JD-005

Test type​

Linked activities​

Result​

Description​

Objective​

Acceptance criteria​

Materials & methods​

Ground truth generation​

Model training and evaluation​

Results​

Protocol deviations​

Conclusions​

Test checklist​

Requirement verification​

Evidence​