REQ_002 The user receives quantifiable data on the count of clinical signs
Category
Major
Source
- Dr. Constanza Balboni, dermatologist at DermoMedic
- Dr. José Luis López Estebaranz, dermatologist at Dermomedic
- Dr. Javier Martín Alcalde, dermatologist at CDI
- Dr. José Luis Ramírez Bellver, dermatologist at CDI
- Dr. Alejandro Martín Gorgojo, dermatologist at Dermomedic
- Dr. Pedro Rodriguez Jiménez, dermatologist at CDI
- Dr. Fernando Alfageme, dermatologist at Hospital Puerta de Hierro
- Dr. Alejandro Vilas Sueiro, dermatologist at Ferrol Teaching University Hospital Complex
- Dr. Ana María González Pérez, dermatologist at Salamanca Teaching University Hospital
- Dr. Laura Vergara de la Campa, Toledo Teaching University Hospital Complex
- Dr. Loreto Luna Bastante, Rey Juan Carlos Teaching University Hospital
- Dr. Rubén García Castro, Fundación Jiménez Díaz Teaching University Hospital
Activities generated
- MDS-98
- MDS-102
- MDS-101
Causes failure modes
- The AI models might incorrectly identify or miscount the number of clinical signs due to algorithmic errors or misinterpretation of the images.
- Poor quality or improperly taken images could result in incorrect counts due to the AI's inability to accurately identify clinical signs.
- Delays or timeouts in processing and delivering the count data could affect timely access to accurate information.
Related risks
-
- Misrepresentation of magnitude returned by the device
-
- Misinterpretation of data returned by the device
-
- Incorrect clinical information: the care provider receives into their system data that is erroneous
-
- Incorrect diagnosis or follow up: the medical device outputs a wrong result to the HCP
-
- Incorrect results shown to patient
-
- Sensitivity to image variability: analysis of the same lesion with images taken with deviations in lightning or orientation generates significantly different results
-
- Inaccurate training data: image datasets used in the development of the device are not properly labeled
-
- Biased or incomplete training data: image datasets used in the development of the device are not properly selected
-
- Lack of efficacy or clinical utility
-
- Stagnation of model performance: the AI/ML models of the device do not benefit from the potential improvement in performance that comes from re-training
-
- Degradation of model performance: automatic re-training of models decreases the performance of the device
User Requirement, Software Requirement Specification and Design Requirement
- User Requirement 2.1: Users shall have access to the exact numerical count of identified clinical signs.
- Software Requirement Specification 2.2: The software shall utilize algorithms capable of accurately identifying and counting multiple clinical signs from the input data.
- Design Requirement 2.3: The device shall output numerical data in a structured format, adhering to the FHIR standard for consistent data interchange.
Description
Skin conditions encompass a range of visible manifestations and associated symptoms. Disease severity can be linked to the intensity of these visible signs, while in other instances, it hinges on their quantity. For example, in acne patients, the number of nodular lesions plays a pivotal role in gauging severity.
Accurate lesion counting is essential for comprehending the extent of a condition. However, similar to intensity quantification, lesion counting is prone to variability. This variability arises from multiple factors, including discrepancies in categorizing specific lesion types and the inadvertent repetition or omission of lesions due to human error. Expertise contributes to minimizing the first type of error, while the second type is influenced by the observer's concentration and fatigue. Consequently, greater expertise and training on the part of the observer generally result in more precise outcomes, although a subjective element persists.
To address this challenge, there is a pressing need to develop a suite of algorithms tailored to automate the counting of various lesion types, including nodules, abscesses, draining tunnels, inflammatory lesions and hives. Thanks to automated counting, it is possible to obtain severity more reliably. For example, thanks to an automatic hive counting algorithm, the objective urticaria severity, most commonly assessed with the UAS score, can be automatically calculated from the total hive count. The same benefit applies to hidradenitis suppurativa (using the automated count of nodules, abscessed, and draining tunnels compute the IHS4 score) and acne (counting the total number of inflammatory lesions to assess severity). However, the ranges for “mild”, “moderate”, and “severe” categories depend on the pathology and the severity score.
Automating lesion counting follows a structured approach, typically divided into two primary phases, echoing the common workflow of data science projects: data annotation and algorithm development.
Data annotation
The first phase, data annotation, is essential to understand the inter-observer variability and serves as the foundation for training the algorithm. During this step, medical professionals carefully review individual images and draw bounding boxes around the specific lesions we are interested in. All professionals are given some training prior to annotation to ensure they understand the task for which data needs to be labelled.
The key here is selecting the right medical experts and determining an appropriate group size. We typically engage a minimum of six experienced physicians for this task. We have enlisted the expertise of six doctors who specialise in assessing the severity of pathologies commonly associated with the visual signs we are studying, specifically acne and hidradenitis suppurativa.
By pooling the assessments of these six experts, we establish a ground truth dataset. This dataset serves a dual purpose: it becomes the foundation for training our algorithms and also allows us to gauge inter-observer variability, a critical measure of the performance and consistency of our measurements.
Algorithm development
The next phase involves the development of the algorithm, which will rely on the ground truth data collected during the previous stage. The outcomes generated by the algorithm will then be juxtaposed with the measured variability. This step is particularly crucial since tasks of this complexity, prone to inherent variability, necessitate comparison with the prevailing baseline or the state-of-the-art standards. This comparative analysis is essential for validating the algorithm's performance accurately.
It's worth emphasizing that the convolutional neural networks we are training will assimilate knowledge from the collective expertise of specialists. It's important to acknowledge that a significant subjective element is inherent in this process, given the nuanced nature of the task.
When possible, the dataset used to develop each algorithm is split into training, validation, and test sets. However, when the sample size is limited, the data is split into training and validation only to ensure each set contains enough data.
Success metrics
In crafting our approach to select the most suitable metric for each counting algorithm, we analyzed different state-of-the-art solutions for object detection and counting as [1,2], and carefully considered the type of output. For lesion and hive counting, where the outputs are utilised directly without further consolidation, we employed precision and recall metrics. These metrics are widely recognised for their efficacy in evaluating object detection models, particularly those driven by deep learning.
The complexity of each task and the varying levels of data availability necessitated a tailored approach to defining success metrics. Capturing inflammatory lesions tends to be more straightforward in typical image acquisition scenarios. Conversely, identifying hives poses greater challenges due to their variable shapes and colors, which can blend in with healthy skin. This complexity is further accentuated when hives of irregular shapes overlap, often resulting in reduced precision and recall. To address these challenges, we adjusted the success metrics to reflect the relative difficulty of each task. This difficulty is also noted in the variability analysis.
For inflammatory lesions, we chose an F1 score of 0.7 and a minimum precision and recall of 70%. This decision is supported by scientific evidence.
A study on automatic acne object detection and acne severity grading using artificial intelligence [3] reported 73% precision, 76% recall, and a 74% F1 score for acne severity quantification based on the IGA grading scale, which focuses on inflammatory lesions. However, this study does not address inter-annotator variability, and there is a lack of research on variability in object detection metrics like precision, recall, and F1 score. This suggests that inflammatory lesions are relatively easy to distinguish.
Despite the lack of precise metrics, we decided to base our success criteria on the performance of the reported algorithm, which was built on solid ground. For their dataset, researchers initially had each image labeled by a junior dermatologist, with a senior dermatologist reviewing and correcting the labels as needed.
On the other hand, the F1 score for hive detection is set at 0.5.
This lower score reflects the greater visual challenge in identifying and counting hives, as noted by doctors.
Although there is no specific scientific evidence on the F1 score for hive detection, our studies with Dr. Alfageme and five other expert dermatologists show that the annotators achieve an F1 score between 0.38 and 0.46.
Therefore, a minimum value of 0.5 would indicate better performance than expert dermatologists.
In terms of precision and recall, if we aim for a balanced model without significant trade-offs, the only way to have equal precision and recall is if both are at 50%. This is the minimum value for both success metrics.
In the process of counting nodules, abscesses, and draining tunnels, we opted for the Mean Absolute Error (MAE) metric. This choice is not arbitrary but is based on the nature of the data we are dealing with. The counts from these features are aggregated to derive the IHS4 score, a composite numerical value that is widely used in the field of dermatology for the assessment of disease severity.
The MAE metric is particularly suitable for this task because it quantifies the average magnitude of the errors in a set of predictions, without considering their direction. This makes it a more reliable measure of model performance, especially when dealing with count data that can't be negative.
Moreover, the MAE facilitates a direct comparison of the model's performance with that of human annotators, aligning with our objective to benchmark against the current gold standard in the field. This is crucial as the task of lesion enumeration is inherently complex and often carried out using ultrasound to more accurately identify the type of lesions.
Given this complexity, we opted for a direct comparison of the algorithm output with the variability observed among human annotators. This approach allows us to assess the model's performance in a real-world context, taking into account the inherent uncertainty and variability in human annotations.
In terms of model performance, we consider a lower MAE as better. This is intuitive as a lower MAE indicates that the model's predictions are closer to the actual values. However, we also acknowledge that some degree of error is inevitable due to the complexity of the task and the variability in human annotations. Therefore, we include a maximum deviation of 10% of the variance as a reasonable error margin. This threshold is based on statistical principles and provides a balance between model accuracy and practical feasibility. By the threshold at 10%, we aim to ensure that the model's predictions remain sufficiently accurate for practical applications while still allowing for some margin of error to account for uncertainties and variability in the data.
Goal | Metric |
---|---|
Automatic inflammatory lesion counting is at expert consensus level | Algorithm precision and recall are greater than 70% (F1 > 0.7) |
Automatic hive counting is at expert consensus level | Algorithm precision and recall are greater than 50% (F1 > 0.5) |
Automatic nodule counting is at expert consensus level | The Mean Absolute Error (MAE) of the algorithm is either less than or within a 10% variance of the MAE observed for individual annotators |
Automatic abscess counting is at expert consensus level | The Mean Absolute Error (MAE) of the algorithm is either less than or within a 10% variance of the MAE observed for individual annotators |
Automatic drainning tunnel counting is at expert consensus level | The Mean Absolute Error (MAE) of the algorithm is either less than or within a 10% variance of the MAE observed for individual annotators |
[1] Cai, Y., Du, D., Zhang, L., Wen, L., Wang, W., Wu, Y., & Lyu, S. (2019). Guided attention network for object detection and counting on drones. arXiv preprint arXiv:1909.11307.
[2] Wang, Y., Hou, J., Hou, X., & Chau, L. P. (2021). A self-training approach for point-supervised object detection and counting in crowds. IEEE Transactions on Image Processing, 30, 2876-2887.
[3] Huynh QT, Nguyen PH, Le HX, Ngo LT, Trinh N-T, Tran MT-T, Nguyen HT, Vu NT, Nguyen AT, Suda K, et al. Automatic Acne Object Detection and Acne Severity Grading Using Smartphone Images and Artificial Intelligence. Diagnostics. 2022; 12(8):1879. https://doi.org/10.3390/diagnostics12081879
Previous related requirements
- REQ_001
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Tester: JD-017, JD-009, JD-005, JD-004
- Approver: JD-003