Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • TF_Legit.Health_Plus
    • Legit.Health Plus TF index
    • Legit.Health Plus STED
    • Legit.Health Plus description and specifications
    • R-TF-001-007 Declaration of conformity
    • GSPR
    • Clinical
    • Design and development
    • Design History File (DHF)
      • Version 1.1.0.0
        • Requirements
        • Test plans
        • Test runs
          • TEST_001 The user receives quantifiable data on the intensity of clinical signs
          • TEST_002 The user receives quantifiable data on the count of clinical signs
          • TEST_003 The user receives quantifiable data on the extent of clinical signs
          • TEST_004 The user receives an interpretative distribution representation of possible ICD categories represented in the pixels of the image
          • TEST_007 If something does not work, the API returns meaningful information about the error
          • TEST_008 Notify the user image modality and if the image does not represent a skin structure
          • TEST_009 Notify the user if the quality of the image is insufficient
          • TEST_010 The user specifies the body site of the skin structure
          • TEST_011 We facilitate the integration of the device into the users' system
          • TEST_012 The user can send requests and get back the output of the device as a response in a secure, efficient and versatile manner
          • TEST_013 The data that users send and receive follows the FHIR healthcare interoperability standard
          • TEST-014 The user authentication feature is functioning correctly
          • TEST_015 Ensure all API communications are conducted over HTTPS
          • TEST_016 Ensure API compliance with Base64 image format and FHIR standard
          • TEST_017 Verification of authorized user registration and body zone specification in device API
          • TEST_018 Ensure API stability and cybersecurity of the medical device
        • Review meetings
        • 🥣 SOUPs
    • IFU and label
    • Post-Market Surveillance
    • Quality control
    • Risk Management
  • Licenses and accreditations
  • External documentation
  • TF_Legit.Health_Plus
  • Design History File (DHF)
  • Version 1.1.0.0
  • Test runs
  • TEST_001 The user receives quantifiable data on the intensity of clinical signs

TEST_001 The user receives quantifiable data on the intensity of clinical signs

Test type​

System

Linked activities​

  • MDS-100
  • MDS-99
  • MDS-173
  • MDS-408

Result​

  • Passed
  • Failed

Description​

Tests carried out with the automatic visual sign intensity quantification algorithms, to verify that the performance is comparable to that of an expert dermatologist.

Objective​

The goal is to prove that the quantifiable data on the intensity of clinical signs received by the user is extracted with expert dermatologist-level performance.

Acceptance criteria​

Relative Mean Absolute Error (RMAE) must be smaller than 20%.

Materials & methods​

Ground truth generation​

We employed a total of 5,459 images, each depicting dermatosis and displaying the visual signs specified in the requirement. These image sets underwent evaluation by distinct dermatology experts, enabling us to conduct a variability analysis. Each expert annotated the data by assigning a label (either ordinal or categorical) to every image. Subsequently, the algorithm learned from the consensus among these experts, which served as the gold standard for training.

In order to assess if the dataset size is adequate, we begin by calculating RMAE values, which measure the variability or error in the data. The RMAE metric provides insight into the variability of annotations across annotators for each instance in the dataset. A higher RMAE indicates greater variability in annotations for a particular instance, suggesting ambiguity or complexity in the task. Conversely, a lower RMAE suggests more consistent annotations, indicating clarity in the task or data. We plot these RMSE values against the dataset size. As the dataset size increases, the standard deviation of these values fluctuates, initially high but gradually decreasing. By observing the standard deviation within a window of three tests, and once it falls below a threshold of 0.02, it indicates that the dataset's variability has stabilised, suggesting that further additions to the dataset are unlikely to significantly impact performance. This approach is asserted to be consistent across different datasets, providing a reliable means of identifying an appropriate dataset size for the task at hand.

To establish the ground truth, we employed mean and median statistics, which are widely recognized and established in the literature as the most suitable methods for constructing ground truth when working with ordinal categorical data. After generating the ground truth labels, the dataset was split to obtain a training, a validation, and a test set for each task. As there was no patient-related metadata available, the data was split randomly using K-fold cross-validation. The reported result metrics are computed on the validation sets.

To determine whether the dataset is sufficiently large to meet our objectives, we first evaluate the complexity of the tasks at hand. This evaluation involves a thorough analysis of medical evidence, supplemented by a variability study conducted in collaboration with experienced doctors. These doctors possess an in-depth understanding of the nuances of each problem, enabling us to establish a robust baseline for our analysis.

In this scenario, we have computed crucial metrics for the annotated dataset, specifically focusing on the Relative Mean Absolute Error (RMAE) and the Relative Standard Deviation (RSD). The results from the doctors yielded values approximately at 13.8% for RMAE and 12.56% for RSD. While these figures reflect a moderate level of agreement among the experts, such a degree of concordance is anticipated for a problem with this complexity. These metrics not only imply that the dataset's size is adequate, but they also establish a foundational benchmark for our computer vision algorithms.

Model training and evaluation​

Regarding the model, we trained several multioutput classifiers, one for each task. Each classifier consists of a single deep-learning backbone and several classification heads, one per visual sign. We used the EfficientNet-B0 network architecture that was pre-trained on approximately 1.28 million images (1,000 object categories) from the 2014 ImageNet Large Scale Visual Recognition Challenge and trained it on our dataset using transfer learning. EfficientNets achieve better accuracy and efficiency than previous convolutional neural networks with fewer parameters by applying a new scaling method that uniformly scales all dimensions of depth/ width/resolution using a simple yet highly effective compound coefficient. There are eight versions, consisting of a different number of parameters, with the B0 being the smallest network that achieves state-of-the-art 77.1% top-1 accuracy on ImageNet for a network consisting of 5 million parameters. Regarding the transfer learning strategy, all the models undergo the same training: first, we freeze all layers (except for the last linear layer) and train the model for several epochs; then we unfreeze all layers and fine-tune the entire model for another number of epochs.

In summary, we implemented one model for each of the following tasks, each task including one or more visual signs:

  • Estimation of intensity of erythema, edema, oozing, excoriation, lichenification, and dryness
  • Estimation of intensity of erythema, induration and desquamation
  • Estimation of erythema, pustulation, and desquamation
  • Estimation of erythema, exudation, edges, affected tissues
  • Estimation of facial palsy

The processors we assessed in this test are APULSI, APASI, NSIL and ASCORAD.

Results​

The algorithms' performance is remarkable. It maintains an average Relative Mean Absolute Error (RMAE) of just 13% for assessing erythema, edema, oozing, excoriation, lichenification, and dryness. Furthermore, all individual RMAE values remain below the 20% threshold. In the cases of induration, desquamation, and pustulation, RMAE values are also comfortably below 20%, meeting the predetermined criteria. For the evaluation of edges, exudation and affected tissues, the outcome demonstrates a balanced accuracy of 64%, 74% and 69%, signifying strong performance. When measuring facial palsy, the algorithm excels with an RMAE of only 9%, further emphasizing its exceptional performance.

Protocol deviations​

There were no deviations from initial protocol.

Conclusions​

The quantifiable data regarding the intensity quantification of clinical signs provided to the user matches the expertise of dermatologists. This ensures the quality of the data, offering healthcare practitioners the best information to support their clinical assessments.

Test checklist​

The following checklist verifies the completion of the goals and metrics specified in the requirement REQ_001.

Requirement verification​

  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying erythema intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying induration intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying desquamation intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying edema intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying oozing intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying excoriation intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying lichenification intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying dryness intensity is less than 20%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying pustulation intensity is less than 20%.
  • Balanced Accuracy for quantifying exudation intensity is greater than 60%.
  • Balanced Accuracy for quantifying edges intensity is greater than 60%.
  • Balanced Accuracy for quantifying affected tissues intensity is greater than 60%.
  • The algorithm's Relative Mean Absolute Error (RMAE) for quantifying facial palsy intensity is less than 20%.

Evidence​

Evidence for erythema, edema, oozing, excoriation, lichenification and dryness can be found in the following article:

Automatic SCOring of Atopic Dermatitis Using Deep Learning: A Pilot Study

Alfonso Medela, Taig Mac Carthy, S. Andy Aguilar Robles, Carlos M. Chiesa-Estomba, Ramon Grimalt

Published: February 10, 2022

DOI: https://doi.org/10.1016/j.xjidi.2022.100107

Evidence for induration, desquamation and pustulation can be found in the following drafts containing the whole procedure:

APASI: Automatic Psoriasis Area Severity Index Estimation using Deep Learning

Alfonso Medela1, *, Taig Mac Carthy1,** ,+, Andy Aguilar1,***, Pedro G ́omez-Tejerina1,****, Carlos M Chiesa-Estomba2,3,4, Fernando Alfageme-Rold ́an5,+, and Gaston Roustan Gull ́on5,+

1 Department of Medical Computer Vision and PROMs, LEGIT.HEALTH, 48013, Bilbao, Spain

2 Department of Otorhinolaryngology, Osakidetza, Donostia University Hospital, 20014 San Sebastian, Spain

3 Biodonostia Health Research Institute, 20014 San Sebastian, Spain

4 Head Neck Study Group of Young-Otolaryngologists of the International Federations of Oto-rhino-laryngological Societies (YO-IFOS), 13005 Marseille, France

5 Servicio de Dermatología, Hospital Puerta de Hierro, Majadahonda, Madrid, Spain

Not published at the time of writing this

In the following confusion matrix we see the performance on pustules:

pustules performance

Evidence of an outcome for exudation intensity, edges intensity, affected tissue intensity:

**Evaluator **Visual sign intensity
ModelEdges: Delimited Exudation: Fibrinous Exudate type: Serous Affected tissues: Affection of bone and/or adnexal tissues.
AnnotatorEdges: Delimited Exudation: Fibrinous Exudate type: Serous Affected tissues: Affection of bone and/or adnexal tissues.

The sample image is the following:

sample image | Edges: Delimited

Exudation: Fibrinous

Exudate type: Serous

Affected tissues: Affection of bone and/or adnexal tissues. |

Evidence of facial palsy intensity quantification:

facial palsy

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

  • Tester: JD-017, JD-009, JD-004
  • Approver: JD-005
Previous
Test runs
Next
TEST_002 The user receives quantifiable data on the count of clinical signs
  • Test type
  • Linked activities
  • Result
  • Description
    • Objective
    • Acceptance criteria
    • Materials & methods
      • Ground truth generation
      • Model training and evaluation
    • Results
    • Protocol deviations
    • Conclusions
  • Test checklist
    • Requirement verification
  • Evidence
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI LABS GROUP S.L.)