SRS-001 The device processes clinical signs and generates an output quantifiying them
Description​
sadfasdf
Desired outcome​
Success metrics​
To accurately gauge the performance of our models, it's essential to recognise that we are dealing with an ordinal classification problem in all instances. Although the labels in this context are categorised as numbers, the output of deep learning models for classification can take two forms. It can be an integer, obtained by applying the argmax function, or it can be continuous, achieved by multiplying each category by its probability and summing the results. Given the nature of this problem, we use as evaluation metrics ones used on state-of-the-art ordinal classification solutions [^1] [^2]. This is the case of the Mean Absolute Error (MAE) or Mean Squared Error (MSE), the most appropriate approach for evaluating most types of visual signs.
However, we must consider that the range for each visual sign can vary, as they are not all quantified on the same scale. This scaling is grounded in clinical evidence. Consequently, interpreting the metric in a percentage format, like the Relative MAE (RMAE), is beneficial for comparative performance analysis. This rationale underpins our decision to use RMAE for most visual sign intensity quantifications.
Regarding the specific value of 20%, let's consider the most prevalent categories, ranging from 0 to 4. In this scenario, a random assessment by multiple annotators would yield an average error rate of about 40% under a uniformly distributed intensity scenario. However, if the distribution peaks at intensity values of 2-3, which is typical in real-world data, the average error rate naturally falls to approximately 30-35%. Drawing on this observation, and considering the variability analysis of annotators who typically achieve accuracy rates between 12% and 20% [3], we opted to set our threshold at 20%. This threshold is chosen with the objective of achieving results that are on par with, or surpass, those of specialists.
When intensity is quantified in different terms, such as ordinal categories without a defined distance between them, we choose the most common (in classification problems) accuracy metric due to its suitability. In particular, we calculate the Balanced Accuracy. This metric accounts for performance across all categories, rather than relying solely on standard accuracy, which can be skewed by an imbalanced dataset—a frequent occurrence in clinical datasets.
Upon reviewing the variability analysis, we determined that a 60% threshold is fitting as it matches the evaluations of the majority of experts. Four wound care specialists assessed factors such as edge intensity, tissue type, and exudate levels, yielding accuracy rates between 50% and 90%. Considering the performance of the most experienced annotator, who approached the consensus level of around 75%, we deemed an 80% threshold appropriate for the model.
Goal | Metric | Target |
---|---|---|
The automatic intensity quantification of erythema is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of induration is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of desquamation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of edema is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of oozing is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of excoriation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of lichenification is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of dryness is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of pustulation is at expert consensus level | RMAE | < 20% |
The automatic intensity quantification of exudation is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of edges is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of affected tissues is at expert consensus level | BAC | > 60% |
The automatic intensity quantification of facial palsy is at expert consensus level | RMAE | < 20% |
References​
[1] Cao, W., Mirjalili, V., & Raschka, S. (2020). Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140, 325-331. [2] SHI, Xintong; CAO, Wenzhi; RASCHKA, Sebastian. Deep neural networks for rank-consistent ordinal regression based on conditional probabilities. Pattern Analysis and Applications, 2023, vol. 26, no 3, p. 941-955. [3] Medela A, Mac Carthy T, Aguilar Robles SA, Chiesa-Estomba CM, Grimalt R. Automatic SCOring of Atopic Dermatitis Using Deep Learning: A Pilot Study. JID Innov. 2022 Feb 11;2(3):100107. doi: 10.1016/j.xjidi.2022.100107. PMID: 35990535; PMCID: PMC9382656.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001