REQ_004 The user receives an interpretative distribution representation of possible ICD categories represented in the pixels of the image
Category
Major
Source
- Alfonso Medela, JD-005
- Dr. Elena Sanchez-Largo
INTERNAL USER SYSTEM INPUTS AND OUTPUTSDATABASE AND DATA DEFINITION ARCHITECTURE
Activities generated
- MDS-103
Causes failure modes
- The diagnosis AI model might misinterpret the clinical signs or incorrectly map image pixels to ICD categories, resulting in inaccurate distribution representation.
- Poor quality or improperly taken images might result in incorrect analysis and ICD categorization.
- Errors in aggregating data from multiple images can lead to inaccuracies in the distribution of ICD categories.
- The presentation of the distribution might be unclear, misleading, or difficult to interpret.
- Changes or updates to ICD codes might not be reflected in the system, leading to outdated or incorrect categorization.
- Variations in lighting, angle, or distance in the images might affect the AI's ability to accurately determine the ICD category distribution.
- If the AI model is not trained on a sufficiently diverse dataset, it might fail to accurately represent ICD categories in certain images.
Related risks
-
- Misrepresentation of magnitude returned by the device
-
- Misinterpretation of data returned by the device
-
- Incorrect clinical information: the care provider receives into their system data that is erroneous
-
- Incorrect diagnosis or follow up: the medical device outputs a wrong result to the HCP
-
- Inconsistent or unreliable output: analysis of the same image of the same version of the device generates different results
-
- Sensitivity to image variability: analysis of the same lesion with images taken with deviations in lightning or orientation generates significantly different results
-
- Inaccurate training data: image datasets used in the development of the device are not properly labeled
-
- Biased or incomplete training data: image datasets used in the development of the device are not properly selected
-
- Stagnation of model performance: the AI/ML models of the device do not benefit from the potential improvement in performance that comes from re-training
-
- Degradation of model performance: automatic re-training of models decreases the performance of the device
User Requirement, Software Requirement Specification and Design Requirement
- User Requirement 4.1: Users shall receive a clear representation of the possible ICD classes based on image analysis.
- Software Requirement Specification 4.2: Implement image analysis algorithms capable of identifying patterns and correlating them with potential ICD classes.
- Design Requirement 4.3: Device response data regarding ICD class distribution shall be precise, using clear key naming and standardized coding conventions.
Description
Diagnosing skin conditions is a challenging task for healthcare professionals with varying levels of expertise. This process involves visually examining a patient's skin and gathering additional information, such as the patient's medical history, genetic factors, symptoms, or any relevant incidents. In dermatology, the visual aspect plays a crucial role in the initial assessment by healthcare providers.
However, there are thousands of different skin conditions, making it difficult even for experienced dermatologists to provide an accurate diagnosis on the first attempt. This challenge is particularly pronounced in primary care, where doctors may lack specialized knowledge and tend to refer patients to specialists. When doctors attempt a diagnosis, they typically consider a range of possible skin conditions (ICD categories) and prescribe treatment based on their confidence level. They may also consult reference books, atlases, or other resources to compare the patient's condition with similar cases.
To assist doctors in the diagnostic process and provide a more reliable and valuable tool, we are developing an algorithm. This algorithm, using one or more images of a skin condition, generates an interpretative distribution representing potential ICD categories present in the image pixels. Based on existing research, presenting the top five likely categories along with their confidence values can enhance diagnostic accuracy by up to 13%, offering significant support to healthcare professionals.
We will calculate various binary indicators based on the interpretative distribution of potential ICD categories. These indicators include the malignancy indicator, dermatological condition indicator, and critical complexity indicator. These clinical outputs provide extra information for users to assess patients more effectively. The binary indicators are calculated by multiplying the ICD categories probability with a binary matrix that indicates which categories are malignant, which are dermatological conditions, and which require urgent dermatological care. This list is created and refined by expert dermatologists based on scientific literature. The following table shows an example with some categories:
ICD category | Malignant | Dermatological condition | Urgent care |
---|---|---|---|
ED80 Acne | 0 | 1 | 0 |
ED80.41 Acne conglobata | 0 | 1 | 1 |
2C30 Melanoma of skin | 1 | 1 | 1 |
The final formula is as follows:
Binary Indicator = ICD Category Probability * Binary Indicator Matrix
Scientific evidence
Han et al. trained an advanced algorithm on a vast dataset of 220,680 images encompassing 174 skin disorders and demonstrated remarkable capabilities. It accurately predicted malignancy, offered primary treatment suggestions, and provided multi-class classification among 134 disorders, ultimately enhancing the performance of medical professionals. With substantial improvements in sensitivity and specificity, this algorithm proved particularly valuable in empowering dermatologists and non-medical professionals alike in diagnosing skin conditions.
A second study introduced a convolutional neural network (CNN) exclusively trained with dermoscopic images for clinical melanoma identification. When compared to 145 dermatologists from 12 German university hospitals, the CNN performed on par, achieving impressive sensitivity and specificity, making it a valuable tool for melanoma diagnosis.
Lastly, the third paper focused on skin cancer but trained a single CNN using a vast dataset comprising 2,032 different diseases. It tested the CNN's capabilities against 21 board-certified dermatologists in two critical binary classification scenarios: identifying keratinocyte carcinomas vs. benign seborrheic keratoses and distinguishing malignant melanomas vs. benign nevi. Remarkably, the CNN's performance matched that of dermatologists, showcasing its potential as an artificial intelligence tool for skin cancer classification, effectively bridging the gap between AI and medical expertise in dermatology.
Methodology
Based on the state-of-the-art, we will go one step further by both improving the dataset and also the deep neural network, applying vision transformers (ViT) to our dataset. We will choose the ViT architecture encouraged by the existing literature (such as “Vision Transformers are Robust Learners”).
Building on the current state-of-the-art methodologies, our approach takes a bold leap forward by incorporating enhancements at two crucial levels: dataset and the refinement of our deep neural network. Our strategy involves the integration of cutting-edge vision transformers, a transformative neural architecture, into our extensive dataset, comprising an impressive collection of over one million images.
To elaborate further, our efforts to improve the dataset entail a meticulous curation process, encompassing data enrichment, diversification, and quality assurance measures. This comprehensive approach aims to provide a robust and diverse foundation for training and evaluation, ensuring that our model gains a more nuanced understanding of the visual data it encounters.
In parallel, we are committed to optimizing the deep learning model architecture itself. This involves fine-tuning model parameters, enhancing feature extraction capabilities, and employing state-of-the-art techniques in deep learning to achieve higher levels of accuracy and generalization.
Success metrics
Our initial success metrics focus on the format to ensure accurate data generation. For instance, when producing a distribution, it's essential that the sum total equals 100%. As for performance assessment, we have carefully analyzed the evaluation methodologies in state-of-the-art medical solutions [1] to define the most appropriate evaluation metrics.
To evaluate the correctness of the most confident values in the generated distribution, we have selected top-1, top-3, and top-5 accuracy. These benchmarks were established in line with the documented performance of human experts in similar tasks [2, 3]. The threshold values of each metric have also been set according to the results presented in [2, 3].
In the case of the binary indicators, which inherently possess a continuous nature but could be converted into a binary format using a threshold, we opted for the Area Under the Curve (AUC) metric. This choice allows for a more nuanced and precise measurement. Considering the performance standards set by medical professionals, we have adopted a rigorous AUC threshold of 0.80, which reflects a commitment to achieving high accuracy in our assessments. This threshold has been set for all binary tasks and is consistent with the observed results in the existing literature. Thanks to the thorough revision of Hasan et al. [1], we observed that most works report results using a training/validation/testing approach in which all splits come from the same data source. This is worth mentioning because, when tested on independent image sets, similar devices have recently shown a worse performance [5]. As we observed values ranging between 0.70 [4] and above 0.90 [3], and considering the possible effects of heterogeneous image quality across the skin datasets that comprise our test set, setting the AUC thresholds to 0.80 is a reasonable yet challenging goal.
Goal | Metric |
---|---|
The algorithm outputs a distribution of ICD categories | The output contains the ICD code and confidence value |
The output distribution is a proportional distribution | The sum of the confidence values add up to 100% |
The performance of ICD category distribution matches or exceeds that of comparable devices | top-5 accuracy > 80% |
top-3 accuracy > 70% | |
top-1 accuracy > 55% | |
Malignancy indicator matches or exceeds that of comparable devices | AUC > 0.8 in malignancy quantification |
Dermatological condition indicator matches or exceeds that of comparable devices | AUC > 0.8 in the quantification of the presence of a dermatological condition |
Critical complexity indicator matches or exceeds that of comparable devices | AUC > 0.8 in the quantification of the critical complexity |
[1] Hasan, M. K., Ahamad, M. A., Yap, C. H., & Yang, G. (2023). A survey, review, and future trends of skin lesion segmentation and classification. Computers in Biology and Medicine, 106624.
[2] Jain A, Way D, Gupta V, et al. Development and Assessment of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices. JAMA Netw Open. 2021;4(4):e217249. doi:10.1001/jamanetworkopen.2021.7249.
[3] Han SS, Park I, Eun Chang S, Lim W, Kim MS, Park GH, Chae JB, Huh CH, Na JI. Augmented Intelligence Dermatology: Deep Neural Networks Empower Medical Professionals in Diagnosing Skin Cancer and Predicting Treatment Options for 134 Skin Disorders. J Invest Dermatol. 2020 Sep;140(9):1753-1761. doi: 10.1016/j.jid.2020.01.019. Epub 2020 Mar 31. PMID: 32243882.
[4] Tschandl P, Rosendahl C, Akay BN, Argenziano G, Blum A, Braun RP, Cabo H, Gourhant JY, Kreusch J, Lallas A, Lapins J. Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks. JAMA dermatology. 2019 Jan 1;155(1):58-65.
[5] Daneshjou R, Vodrahalli K, Novoa RA, Jenkins M, Liang W, Rotemberg V, Ko J, Swetter SM, Bailey EE, Gevaert O, Mukherjee P. Disparities in dermatology AI performance on a diverse, curated clinical image set. Science advances. 2022 Aug 12;8(31):eabq6147.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Tester: JD-017, JD-009, JD-005, JD-004
- Approver: JD-003