R-TF-028-011 AI Risk Assessment
Table of contents
Purpose
This document provides a comprehensive risk assessment specifically for the Artificial Intelligence (AI) components of Legit.Health Plus, in accordance with:
- ISO 14971:2019 – Application of Risk Management to Medical Devices
- MDR 2017/745 – General Safety and Performance Requirements (GSPRs), particularly GSPR 17 (Software with diagnostic or measuring function)
- MDCG 2020-1 – Guidance on Clinical Evaluation of Medical Device Software (MDSW)
- EU AI Act – Regulation laying down harmonised rules on Artificial Intelligence (high-risk AI systems)
- IEC 62304:2006+A1:2015 – Medical Device Software Lifecycle Processes (Class B software)
- GP-028 – Internal procedure for AI Development and Risk Management
This AI Risk Assessment is integrated with and traceable to the overall device Risk Management Plan (R-TF-013-001) and Risk Assessment (R-TF-013-002) per ISO 14971 requirements.
Scope
This assessment covers all AI algorithms within Legit.Health Plus version 1.1.0.0:
Clinical Models (Class B per IEC 62304)
| Model | Function | Performance Threshold |
|---|---|---|
| ICD Category Distribution | Probability distribution across ICD-11 dermatological categories | Top-1 ≥50%, Top-3 ≥60%, Top-5 ≥70% |
| Binary Indicators | Malignant, pre-malignant, associated with malignancy, pigmented lesion, urgent referral (≤48h), high-priority referral (≤2 weeks) | AUC ≥0.80 each |
| Visual Sign Quantification | Erythema, desquamation, induration, pustule, crusting, xerosis, swelling, oozing, excoriation, lichenification | RMAE thresholds per sign |
| Wound Characteristic Assessment | Tissue type classification, wound staging | Balanced accuracy ≥0.70 |
| Surface Segmentation | Lesion/wound area quantification | IoU ≥0.70 |
Non-Clinical Models (Supporting Functions)
| Model | Function | Performance Threshold |
|---|---|---|
| DIQA | Image quality assessment and filtering | Accuracy ≥90%, Pearson r ≥0.80 |
| Domain Validation | Clinical/dermoscopic/non-skin classification | Overall accuracy ≥95% |
| Skin Surface Segmentation | Skin region detection and isolation | IoU ≥0.85 |
| Body Surface Segmentation | Skin segmentation for analysis | IoU ≥0.80 |
Methodology
Risk Identification
AI-specific risks were identified through:
- Regulatory guidance analysis: MDCG 2019-11, FDA Guidance on AI/ML-based SaMD, Health Canada Pre-Market Guidance for MLMD
- Literature review: Published AI failure modes in dermatology and medical imaging
- FMEA approach: Systematic analysis of each AI development stage (data collection → annotation → training → validation → deployment → monitoring)
- Clinical workflow analysis: Potential misuse scenarios and use errors involving AI outputs
- Expert consultation: Input from AI engineers, dermatologists, regulatory affairs, and clinical safety specialists
Risk Estimation
Risks are estimated using the 5×5 Risk Matrix defined in R-TF-013-001 Risk Management Plan:
Severity Scale
| Level | Description | Definition |
|---|---|---|
| 1 | Negligible | No impact on patient health or clinical decision |
| 2 | Minor | Inconvenience or temporary minor impact; fully recoverable |
| 3 | Moderate | Significant impact requiring additional intervention; recoverable |
| 4 | Critical | Serious harm including delayed diagnosis of serious condition |
| 5 | Catastrophic | Death or irreversible serious harm |
Likelihood Scale
| Level | Description | Probability |
|---|---|---|
| 1 | Very low | <1% occurrence rate |
| 2 | Low | 1-5% occurrence rate |
| 3 | Moderate | 5-15% occurrence rate |
| 4 | High | 15-50% occurrence rate |
| 5 | Very high | >50% occurrence rate |
Risk Priority Number (RPN)
RPN = Severity × Likelihood
| RPN Range | Risk Class | Required Action |
|---|---|---|
| 1-4 | Acceptable | Risk acceptable; document and monitor |
| 5-9 | Tolerable | Risk reduction measures recommended; benefit-risk evaluation required |
| 10-25 | Unacceptable | Risk reduction mandatory before release |
Risk Control Measures
For each identified risk, control measures follow the priority hierarchy per ISO 14971:
- Inherent safety by design: Algorithm architecture, training data diversity, performance thresholds
- Protective measures: Quality gates (DIQA, Domain Validation), confidence thresholds, multi-model redundancy
- Information for safety: IFU warnings, user training, transparency documentation, API error responses
Traceability
Each AI risk is traced to:
- AI Specifications:
R-TF-028-001 AI Description - Safety Risks:
R-TF-013-002 Risk Assessment(where applicable) - Clinical Validation:
R-TF-015-001 Clinical Evaluation Plan - Post-Market Surveillance:
GP-007 Post-Market Surveillance
Risk Assessment Table
| ID | Issue type | Issue key | Summary | Root cause / Sequence of events | Consequences | AI Specifications Originating the Risk | Initial severity | Initial likelihood | Initial RPN | Initial risk class | Risk control measures | Residual severity | Residual likelihood | Residual RPN | Residual risk class | Transfer to Safety Risks | Relevant for Safety (Justification) | Safety Risk IDs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | AI Risk | AI-RISK-001 | Dataset Not Representative of Intended Use Population | Collected dermatological images do not adequately represent the diversity of the target population (Fitzpatrick skin types I-VI, anatomical locations, ICD-11 conditions, demographics) specified in the intended use, leading to models that fail to generalize across all patient populations. | AI models underperform on underrepresented patient subgroups (particularly darker Fitzpatrick skin types V-VI, rare ICD-11 conditions, specific anatomical sites), leading to diagnostic errors, incorrect visual sign severity assessments, and health inequities in clinical use. | R-TF-028-001 Sections 'ICD Category Distribution' (requires Top-k accuracy across diverse populations), 'Visual Sign Quantification' (validation across Fitzpatrick I-VI), 'Data Specifications' (requires representative data across demographics). All clinical models specify validation requirements across diverse populations per R-TF-028-003. | Critical (4) | Moderate (3) | 12 | Tolerable | Multi-source data collection strategy: prospective hospital data from dermatology departments + retrospective atlas data + independent evaluation hold-out sets (R-TF-028-003). Documented demographics of collected datasets including Fitzpatrick skin type distribution (I-VI), age, gender, anatomical sites, and condition prevalence across ICD-11 categories (R-TF-028-005 Development Report). Stratified sampling to ensure balanced representation across critical demographic and clinical variables. Bias analysis and fairness evaluation across Fitzpatrick skin types with minimum performance thresholds enforced per subgroup. Independent evaluation on sequestered hold-out test sets with documented population characteristics (R-TF-028-002 Development Plan). Performance reporting stratified by Fitzpatrick skin type, anatomical site, and ICD-11 category prevalence | Critical (4) | Very low (1) | 4 | Acceptable | YES | Non-representative training data leads to model bias and incorrect diagnostic outputs for underrepresented populations (particularly Fitzpatrick skin types V-VI), potentially causing misdiagnosis per ISO 14971 harm assessment, MDR GSPR 22 (non-discrimination), and EU AI Act fairness requirements. | R-SKK, R-7US, R-GY6 |
| 2 | AI Risk | AI-RISK-002 | Data Annotation Errors by Expert Dermatologists | Expert dermatologists provide incorrect or inconsistent reference standard labels for ICD-11 categories, binary indicators (malignancy, urgent referral), visual severity signs (erythema, desquamation, induration intensity 0-9), wound characteristics (22 binary classifiers), or other annotations due to inter-observer variability, lack of clear annotation guidelines, or annotation fatigue. | Models are trained and validated on erroneous or inconsistent reference standard, resulting in unreliable ICD probability distributions, incorrect visual sign severity quantification, and misleading binary indicators that could impact clinical triage and patient care. | R-TF-028-001: All clinical models require expert dermatologist annotations—ICD Category Distribution, Visual Sign Quantification (10 ordinal categories per sign), Wound Assessment (staging 0-4, intensity 0-19, 22 binary characteristics). R-TF-028-004 Data Annotation Instructions specify protocols for ICD-11 mapping, visual signs, and binary indicators. | Critical (4) | High (4) | 16 | Unacceptable | All annotations performed exclusively by board-certified dermatologists with demonstrated expertise in relevant subspecialties. Comprehensive annotator training and calibration sessions using reference image sets with known reference standard. Detailed, reproducible annotation instructions documented in R-TF-028-004 with visual examples, severity anchor images, and edge case guidance for all clinical models. Multi-expert annotation with consensus or adjudication protocols: minimum 3 dermatologists for ICD reference standard, senior reviewer for discrepancies. Inter-rater agreement assessment (Cohen's κ for categorical, ICC for ordinal) documented in R-TF-028-005 with minimum thresholds (κ ≥ 0.60 for ICD, ICC ≥ 0.70 for severity). Histopathological correlation for reference standard determination where clinically appropriate (malignancy, ambiguous diagnoses). Automated outlier detection via cross-validation to identify potentially erroneous annotations for re-review. Regular annotation quality audits and re-calibration sessions during data collection phases | Critical (4) | Low (2) | 8 | Tolerable | YES | Annotation errors in reference standard labels directly impact model training quality, leading to unreliable ICD probability distributions, incorrect severity scores, and misleading binary indicators affecting patient triage per ISO 14971 requirements. | R-SKK, R-GY6 |
| 3 | AI Risk | AI-RISK-003 | Inadequate Model Evaluation Metrics or Test Data | AI models are evaluated using inappropriate metrics, insufficient test data, or non-independent datasets, resulting in performance estimates that do not reflect real-world clinical performance across ICD-11 categories, 10 visual sign intensity models, and wound assessment algorithms. | Deployed models perform worse than validated metrics suggest in clinical use, leading to incorrect Top-k ICD suggestions, severity misclassification (erythema, desquamation, induration RMAE exceeds thresholds), inappropriate binary indicator outputs, and loss of clinician trust. | R-TF-028-001: Performance endpoints per algorithm type—ICD Category Distribution (Top-1 ≥50%, Top-3 ≥60%, Top-5 ≥70%), Binary Indicators (AUC ≥0.80), Visual Signs (RMAE ≤14-36% depending on sign), Wound Assessment (RMAE ≤10% staging, ≤24% intensity, BA ≥50-55% characteristics). All with 95% CI requirements. | Critical (4) | Moderate (3) | 12 | Tolerable | Detailed evaluation reports for each algorithm documenting all specified metrics per R-TF-028-001 thresholds (R-TF-028-005 Development Report). Each algorithm performance objective covered by dedicated evaluation against independent test sets with appropriate sample size calculations. Strict sequestration of test data from training and validation sets—held-out, used only once for final unbiased evaluation (R-TF-028-002 Development Plan). Evaluation code unit tested and version controlled to ensure metric calculation correctness. Performance reported with 95% confidence intervals for all metrics as required by R-TF-028-001. Stratified evaluation across critical subgroups: Fitzpatrick skin types I-VI, anatomical sites, severity levels, ICD-11 category prevalence. Comparison to expert dermatologist performance baselines (inter-observer variability) documented in literature and validation studies. Statistical validation confirming model meets or exceeds all specified thresholds before deployment authorization | Critical (4) | Very low (1) | 4 | Acceptable | YES | Inadequate evaluation metrics or test data can result in deployed models performing worse than expected in clinical use, leading to incorrect diagnoses and potential patient harm per ISO 14971 and MDR clinical evaluation requirements (MDCG 2020-1). | R-SKK, R-VL1 |
| 4 | AI Risk | AI-RISK-004 | Suboptimal Model Architecture or Hyperparameters | Selected deep learning architectures (Vision Transformers for ICD classification, EfficientNet for DIQA, encoder-decoder for segmentation, CNNs for intensity quantification) or hyperparameters (learning rate, regularization, loss functions, temperature scaling) are inappropriate for dermatological image analysis tasks, leading to poor convergence, overfitting, or underfitting. | Models fail to meet specified performance thresholds—ICD Top-k accuracy below 50%/60%/70%, visual sign RMAE above thresholds (14-36%), binary indicator AUC below 0.80—resulting in clinically inadequate outputs that cannot support healthcare professionals in patient assessment. | R-TF-028-001: Specifies architectures (Vision Transformer for ICD, EfficientNet for DIQA, encoder-decoder for segmentation) and performance endpoints. R-TF-028-002 Development Plan: training methodology, hyperparameter optimization, model calibration (temperature scaling). | Critical (4) | Moderate (3) | 12 | Tolerable | Systematic hyperparameter optimization studies (Bayesian optimization, grid search) documented in R-TF-028-005 Development Report. Evaluation of multiple state-of-the-art architectures for each task: Vision Transformers, ConvNeXt, EfficientNet variants, hybrid approaches. Transfer learning from large-scale pretrained models (ImageNet, dermatological datasets) to improve convergence and performance. Validation on independent datasets to assess generalization before architecture selection. Regularization techniques (dropout, weight decay, data augmentation) to prevent overfitting per R-TF-028-002. Early stopping and learning rate scheduling based on validation performance with TensorBoard monitoring. Model calibration using temperature scaling to ensure output probabilities are reliable (calibration curves documented). Ablation studies to validate architectural choices and component contributions. Peer review of model architecture and training protocols by AI/ML specialists | Critical (4) | Very low (1) | 4 | Acceptable | YES | Suboptimal architecture or hyperparameters lead to models failing to meet performance thresholds, resulting in clinically inadequate ICD distributions and severity scores that cannot support healthcare professionals per ISO 14971 and EU AI Act requirements for AI system performance. | R-SKK, R-VL1 |
| 5 | AI Risk | AI-RISK-005 | Cybersecurity: Model Extraction or Adversarial Input Attacks | Malicious actors attempt to extract proprietary model weights/architecture through API probing, or craft adversarial inputs (modified images) designed to cause incorrect ICD classifications, false binary indicator outputs, or erroneous severity predictions, compromising model security and clinical reliability. | Model intellectual property is stolen enabling unauthorized use, or adversarial attacks cause systematic misdiagnoses (false negatives for malignancy), incorrect severity scores, or safety-critical failures in clinical deployment. | R-TF-028-001 Section 'Cybersecurity and Transparency': models deployed within Legit.Health Plus via REST API, static models (no continuous learning), input validation via DIQA and Domain Validation non-clinical models. R-TF-Device Description: API-based deployment with authentication. | Critical (4) | Moderate (3) | 12 | Tolerable | REST API with mandatory API key authentication limits unauthorized access (documented in R-TF-Device Description). Static models (no continuous learning or online updates) prevent training-time poisoning attacks. Input validation via multi-stage quality gates: Domain Validation model rejects non-skin images, DIQA model filters poor-quality inputs. Model weights encrypted in deployment package; inference performed server-side without exposing model architecture. Rate limiting and monitoring of API requests to detect probing or extraction attempts. Adversarial robustness testing during validation phase including perturbed image inputs. No direct model access provided to users—all inference via controlled API endpoints. Security review and penetration testing of deployment architecture per IEC 81001-5-1. Incident response plan for suspected adversarial attacks or security breaches | Critical (4) | Very low (1) | 4 | Acceptable | YES | AI-specific cybersecurity threats (model extraction, adversarial attacks) can cause systematic misdiagnoses (false negative malignancy indicators) and safety-critical failures per ISO 14971 and IEC 81001-5-1 cybersecurity requirements for AI-enabled medical devices. | R-SKK, R-VL1 |
| 6 | AI Risk | AI-RISK-006 | Bias and Fairness: Disparate Performance Across Fitzpatrick Skin Types | AI models exhibit significantly degraded performance on darker skin types (Fitzpatrick V-VI) compared to lighter skin types (I-III) due to dataset imbalance, lighting artifacts in darker skin photography, or algorithm design biases, perpetuating health inequities in dermatological AI. | Patients with Fitzpatrick V-VI skin receive inaccurate ICD probability distributions (Top-k accuracy drops), incorrect severity assessments (erythema RMAE increases significantly on dark skin), and unreliable binary indicators, leading to delayed or inappropriate treatment and exacerbating healthcare disparities. | R-TF-028-001: All clinical models (ICD, Visual Signs, Binary Indicators) require validation across Fitzpatrick skin types I-VI with documented performance per subgroup. | Critical (4) | High (4) | 16 | Unacceptable | Prospective data collection from diverse clinical sites ensures representation across all Fitzpatrick types I-VI with documented distribution targets. Stratified performance evaluation with metrics (Top-k, RMAE, AUC) reported separately for each Fitzpatrick type in R-TF-028-005 Development Report. Bias analysis and fairness audits using performance parity metrics (equalized odds, calibration across groups) conducted during development. Data augmentation techniques designed to preserve skin tone characteristics while increasing dataset diversity. Balanced training strategies (oversampling, loss weighting) to prevent model bias toward overrepresented Fitzpatrick types I-III. Minimum performance thresholds enforced for each Fitzpatrick type subgroup (no subgroup >20% below overall performance). Post-market surveillance includes stratified performance monitoring by Fitzpatrick skin type with alert thresholds. Clinical validation studies include diverse patient populations across all Fitzpatrick types (documented in R-TF-015-001 Clinical Evaluation Plan) | Critical (4) | Low (2) | 8 | Tolerable | YES | Disparate AI performance across Fitzpatrick skin types leads to health inequities and inaccurate diagnoses for patients with darker skin (Fitzpatrick V-VI), directly impacting patient safety per ISO 14971, MDR GSPR 22 (non-discrimination), and EU AI Act Article 10 (bias prevention). | R-SKK, R-7US, R-GY6 |
| 7 | AI Risk | AI-RISK-007 | Model Training Failures: Overfitting or Underfitting | Models overfit to training data (memorizing rather than generalizing) or underfit (failing to learn relevant patterns), resulting in poor performance on new patient images during clinical deployment. | Overfitting leads to excellent training performance but poor real-world generalization. Underfitting results in consistently poor performance. Both compromise clinical utility and patient safety. | R-TF-028-001: All models specify performance thresholds on independent test sets. Development Plan (R-TF-028-002) defines training procedures and validation protocols. | Critical (4) | High (4) | 16 | Unacceptable | Training monitored using TensorBoard with validation metrics tracked throughout. Early stopping based on validation set performance prevents overfitting. Comprehensive regularization techniques: dropout, weight decay, batch normalization, data augmentation. Stratified train/validation/test splits ensure representative evaluation at all stages. K-fold cross-validation during hyperparameter tuning for robust parameter selection. Learning curve analysis to diagnose overfitting/underfitting and guide corrective actions. Independent test set performance as final acceptance criterion (never used during development). Ablation studies validate that model complexity matches task difficulty. Training logs and model checkpoints version controlled for traceability and reproducibility | Critical (4) | Very low (1) | 4 | Acceptable | YES | Overfitting or underfitting results in poor real-world model performance and generalization failures, compromising clinical utility and patient safety per ISO 14971 performance requirements. | R-SKK, R-VL1, R-75L |
| 8 | AI Risk | AI-RISK-008 | Model Deployment Failures: Development vs. Deployed Performance Mismatch | Models converted for deployment (e.g., to TensorFlow Lite, ONNX, or mobile-optimized formats) exhibit different numerical outputs or degraded performance compared to development versions due to conversion errors, precision loss, or implementation bugs. | Deployed models provide inaccurate predictions despite successful validation during development, leading to unreliable clinical outputs and potential patient harm. | R-TF-028-001 Section 'Other Specifications': deployment conversion validated by prediction equivalence testing. R-TF-028-006 AI/ML Release Report documents deployment validation. | Critical (4) | Moderate (3) | 12 | Tolerable | Models deployed using validated frameworks compatible with development environment (e.g., TensorFlow → TensorFlow Lite). Numerical equivalence testing: deployed models compared against development models on identical test inputs. Integration tests verify end-to-end pipeline produces expected outputs on reference images. Quantization and optimization validated to ensure accuracy degradation within acceptable bounds (typically <1%). Visual inspection and statistical comparison of outputs from both development and deployed models. Version control and traceability from development models to deployed artifacts (R-TF-028-006 Release Report). Automated regression testing suite runs with each model deployment. Clinical validation performed on final deployed models in target deployment environment | Critical (4) | Very low (1) | 4 | Acceptable | YES | Deployment conversion errors (precision loss, implementation bugs) cause deployed models to exhibit different outputs than validated versions, leading to unreliable clinical results per ISO 14971 and IEC 62304 deployment validation requirements. | R-SKK, R-VL1 |
| 9 | AI Risk | AI-RISK-009 | Data Preprocessing Errors Destroying Clinically Relevant Information | Image preprocessing operations (resizing, normalization, augmentation) inadvertently remove or alter clinically important features (erythema intensity, lesion boundaries, texture patterns) critical for accurate model predictions. | Models fail to learn or detect relevant dermatological features, resulting in poor diagnostic accuracy, incorrect severity assessment, and unreliable clinical outputs. | R-TF-028-001: All image-based models require preservation of clinical features (erythema, desquamation, lesion morphology). Development methodology includes preprocessing pipeline definition. | Critical (4) | Moderate (3) | 12 | Tolerable | Multiple preprocessing strategies tested during development with ablation studies. Visual inspection of preprocessed images by dermatologists to confirm feature preservation. Preprocessing pipelines designed to preserve color accuracy (critical for erythema, pigmentation assessment). Augmentation strategies validated to maintain clinical realism (e.g., brightness/contrast adjustments within physiological ranges). Augmentation parameters constrained to prevent unrealistic transformations (e.g., no extreme rotations that violate anatomical constraints). Preprocessing code unit tested with reference images and known expected outputs. Documentation of preprocessing rationale and clinical impact assessment (R-TF-028-005 Development Report). Expert dermatologist review of augmentation examples to ensure clinical validity | Critical (4) | Very low (1) | 4 | Acceptable | YES | Preprocessing errors that remove clinically relevant features (erythema intensity, lesion boundaries) lead to poor diagnostic accuracy and unreliable severity assessments per ISO 14971 harm analysis. | R-SKK, R-GY6 |
| 10 | AI Risk | AI-RISK-010 | Incorrect Model Integration: Pre/Post-Processing Implementation Errors | Models integrated into Legit.Health Plus software with incorrect pre-processing (wrong image normalization, incorrect color space conversion) or post-processing (incorrect weighted expected value calculation for severity scores, incorrect binary indicator mapping matrix application) due to implementation bugs or documentation errors. | Models produce incorrect outputs despite being correctly trained and validated—wrong ICD probability distributions, incorrect visual sign severity values, incorrect binary indicator values—leading to clinical decisions based on mathematically incorrect computations. | R-TF-028-001: Specifies post-processing formulas—weighted expected value ŷ = Σ(i × pᵢ) for visual signs, binary indicator mapping matrix M_ij. R-TF-028-006 AI Release Report documents integration specifications. | Critical (4) | Moderate (3) | 12 | Tolerable | Detailed integration specifications documented in R-TF-028-006 AI Release Report including exact mathematical formulas, input/output formats, and reference implementations. Unit tests for all pre-processing functions (normalization, resizing, color space) with known input-output pairs. Unit tests for all post-processing functions (weighted expected value, mapping matrix) verifying mathematical correctness. Integration tests comparing deployed software implementation against validated Python reference implementation on identical inputs. End-to-end validation using reference images with known expected outputs from development environment. Code review of integration code by both AI/ML team and software engineering team per GP-012 Design and Development. Regression testing suite executed with each software build ensuring no drift from expected outputs. Clinical validation performed on final integrated system (not just standalone models) per R-TF-015-001 Clinical Evaluation Plan. Traceability from model specifications (R-TF-028-001) through implementation to validation results (R-TF-012-043 Traceability Matrix) | Critical (4) | Very low (1) | 4 | Acceptable | YES | Incorrect pre/post-processing implementation causes models to produce erroneous outputs despite correct training—incorrect ICD probabilities, wrong visual sign values—leading to clinical decisions based on mathematically incorrect computations per ISO 14971 and IEC 62304. | R-SKK, R-VL1 |
| 11 | AI Risk | AI-RISK-011 | Insufficient Dataset Size for Model Complexity | Collected dataset is too small to support training of deep learning models with millions of parameters, particularly for rare conditions or underrepresented categories, resulting in poor generalization. | Models perform poorly on rare conditions, minority skin types, or underrepresented anatomical sites, leading to systematic diagnostic failures for specific patient populations. | R-TF-028-001 Section 'Data Specifications': requires large-scale data collection ([NUMBER OF IMAGES] dermatological images) with diversity across conditions, skin types, and anatomical sites. Multiple data sources specified. | Critical (4) | Moderate (3) | 12 | Tolerable | Multi-source data collection strategy: prospective hospital data + retrospective atlas data + evaluation hold-out sets. Targeted data collection for rare conditions and underrepresented categories through specialized sources. Data augmentation to increase effective training set size while preserving clinical realism. Transfer learning from large-scale pretrained models (ImageNet, dermatological datasets) to reduce data requirements. Sample size calculations and power analysis to determine adequate dataset sizes per category. Learning curve analysis to validate that additional data would not significantly improve performance. Documentation of dataset size and composition in R-TF-028-005 Development Report. Performance evaluation stratified by category prevalence to identify underperforming rare classes. Minimum sample size thresholds per ICD category, severity level, and demographic subgroup | Critical (4) | Very low (1) | 4 | Acceptable | YES | Insufficient dataset size leads to poor model generalization on rare conditions and underrepresented categories, causing systematic diagnostic failures for specific patient populations per ISO 14971 and MDR requirements. | R-SKK, R-7US, R-GY6 |
| 12 | AI Risk | AI-RISK-012 | Data Collection Protocol Failures | Images collected during prospective data collection fail to meet quality standards, lack required metadata, or do not follow standardized imaging protocols, resulting in unusable or low-quality training data. | Insufficient high-quality data for model development, leading to delayed development timelines or models trained on poor-quality data with suboptimal performance. | R-TF-028-001 Section 'Data Specifications': prospective and retrospective data collection with quality requirements. R-TF-028-003 Data Collection Instructions define protocols. | Moderate (3) | Moderate (3) | 9 | Tolerable | Comprehensive data collection protocols documented in R-TF-028-003 with clear imaging standards, metadata requirements, and quality criteria. Training and certification of data collection personnel (photographers, clinicians). Real-time quality checks during data collection with immediate feedback for non-compliant images. Standardized imaging equipment and settings specified in protocols. Metadata validation at point of collection to ensure completeness. Regular audits of collected data quality by AI/ML team with feedback to collection sites. DIQA (Image Quality Assessment) model applied to prospectively collected images to identify quality issues early. Iterative protocol refinement based on initial data collection experiences. Multiple data collection sites to ensure robustness to site-specific variations | Moderate (3) | Low (2) | 6 | Acceptable | YES | Data collection protocol failures result in insufficient high-quality data for model development, potentially leading to models trained on poor-quality data with suboptimal performance per ISO 14971. | R-GY6, R-VL1 |
| 13 | AI Risk | AI-RISK-013 | Cybersecurity: Data Breach During Development | Patient images and associated clinical data collected for AI development are accessed by unauthorized parties due to inadequate data security controls during collection, storage, or processing. | Patient privacy violations, regulatory non-compliance (GDPR, HIPAA), loss of patient trust, and potential legal/financial consequences for the organization. | R-TF-028-001 Section 'Cybersecurity and Transparency': data de-identified/pseudonymized, research server restricted access, secure segregation required. | Critical (4) | Moderate (3) | 12 | Tolerable | All patient data de-identified/pseudonymized before transfer to AI development environment. Research servers with restricted access controls (authentication, authorization, role-based access). Data encryption at rest and in transit (SSL/TLS for transfers, encrypted storage). Network segregation isolating research data environment from public networks. Access logging and monitoring with regular security audits. Data processing agreements with all data sources and collaborators. Regular security training for all personnel with data access. Incident response plan for potential data breaches. Compliance with GDPR, HIPAA, and applicable data protection regulations. Data retention and deletion policies to minimize exposure window | Critical (4) | Very low (1) | 4 | Acceptable | NO | Data breach during development is primarily a privacy/regulatory compliance risk rather than a direct patient safety risk. While serious, it does not directly impact model performance or diagnostic accuracy per ISO 14971 harm categories. | - |
| 14 | AI Risk | AI-RISK-014 | Poor Data Quality: Non-Diagnostic Images in Training Set | Training dataset contains significant proportion of poor-quality images (blurry, poorly lit, obstructed, wrong anatomical site) that were not filtered out during quality control, degrading model learning. | Models learn from low-quality examples, reducing accuracy and potentially learning to accept poor-quality inputs that should be rejected, compromising clinical reliability. | R-TF-028-001: DIQA (Image Quality Assessment) non-clinical model filters poor-quality images. Data quality requirements in Data Specifications section. | Critical (4) | Moderate (3) | 12 | Tolerable | DIQA (Dermatology Image Quality Assessment) model automatically filters images below quality threshold (score ≥6 for clinical use). All images used for training/evaluation reviewed by expert dermatologists during annotation (quality confirmed during labeling). Multi-stage quality checks by AI/ML team: automated quality metrics, visual inspection, outlier detection. Quality criteria defined in data collection protocols (R-TF-028-003). Statistical analysis of dataset quality distributions documented in R-TF-028-005 Development Report. Quality-based stratified sampling ensures training set contains only diagnostic-quality images. Separate evaluation of model performance on varying quality levels to assess robustness. Ongoing quality monitoring during data collection with feedback loops to improve collection procedures | Critical (4) | Very low (1) | 4 | Acceptable | YES | Poor-quality non-diagnostic images in training set degrade model learning and can cause models to accept poor-quality inputs that should be rejected, compromising clinical reliability per ISO 14971. | R-SKK, R-GY6 |
| 15 | AI Risk | AI-RISK-015 | Inadequate Development Environment and Infrastructure | AI development environment lacks sufficient computational resources (GPUs), appropriate software libraries, or version control, leading to inefficient development, irreproducible results, or technical failures. | Development delays, inability to train complex models effectively, poor model performance due to resource constraints, or inability to reproduce and validate results. | R-TF-028-001 Section 'Other Specifications': fixed hardware/software stack required with version tracking. Development methodology requires reproducible environment. | Moderate (3) | Low (2) | 6 | Acceptable | Dedicated GPU-enabled workstations and cloud compute infrastructure for AI/ML development. State-of-the-art deep learning frameworks (TensorFlow, PyTorch) with version pinning. Containerized development environment (Docker) ensuring reproducibility and consistency across team. Version control for all code, models, and configurations (Git). Dependency management with requirements.txt or conda environment files shared across team. Documented software stack versions in development reports (R-TF-028-005). Regular infrastructure updates and maintenance. Backup and disaster recovery procedures for development data and model checkpoints. Continuous integration/continuous deployment (CI/CD) pipelines for automated testing | Moderate (3) | Very low (1) | 3 | Acceptable | NO | Inadequate development infrastructure primarily affects development efficiency and timelines rather than direct patient safety. Poor reproducibility is addressed through process controls per ISO 13485. | - |
| 16 | AI Risk | AI-RISK-016 | Model Robustness Failures: Sensitivity to Image Acquisition Variability | Models are brittle to natural variations in imaging conditions (lighting, camera angle, distance, device type, background) commonly encountered in clinical practice, leading to inconsistent predictions. | Model performance degrades significantly when images are captured under non-ideal conditions, limiting clinical utility and potentially causing diagnostic errors in real-world use. | R-TF-028-001 Section 'Integration and Environment': models must handle variability in acquisition. All models specify validation across diverse imaging conditions and devices. | Critical (4) | Moderate (3) | 12 | Tolerable | Training data includes diverse imaging conditions (multiple devices, lighting, angles) from prospective clinical collection. Data augmentation simulating realistic imaging variations (brightness, contrast, rotation, noise). Validation on images from multiple acquisition sources and device types. Color normalization and preprocessing techniques to improve robustness to lighting variations. DIQA model provides quality gate rejecting images outside acceptable acquisition parameter ranges. Performance evaluation stratified by imaging device, lighting condition, and other technical factors. Clinical validation studies using images from target deployment environments and device types. User guidance and training on optimal image acquisition practices to minimize extreme variability. Robustness testing with intentionally varied imaging conditions during validation | Critical (4) | Low (2) | 8 | Tolerable | YES | Model brittleness to imaging variability (lighting, camera angle, device type) leads to inconsistent predictions and diagnostic errors in real-world clinical conditions per ISO 14971 and MDR GSPR requirements. | R-SKK, R-VL1 |
| 17 | AI Risk | AI-RISK-017 | Lack of Transparency: Users Unaware of AI/ML Usage and Limitations | Healthcare professionals integrating Legit.Health Plus via API are not adequately informed that AI/ML algorithms generate the ICD probability distributions, severity scores, and binary indicators; do not understand model limitations (performance thresholds, validation populations, edge cases); or are unaware that outputs require clinical interpretation and cannot replace diagnostic judgment. | Over-reliance on AI outputs without critical clinical evaluation, misuse of device outside validated intended use (e.g., using for conditions outside ICD-11 categories), automation bias leading to failure to recognize AI errors, or inappropriate clinical decisions based on misunderstood output semantics. | R-TF-028-001 Section 'Cybersecurity and Transparency': Documentation must clearly state algorithm purpose, inputs/outputs, performance metrics, limitations, and that AI/ML models generate clinical outputs. R-TF-Device Description: Intended use specifies decision support role. EU AI Act Article 13 transparency requirements. | Moderate (3) | Moderate (3) | 9 | Tolerable | IFU (Instructions for Use) clearly states that AI/ML deep learning algorithms generate: ICD-11 probability distributions, binary indicators (malignancy, urgency), visual sign severity scores, wound assessments. Intended use statement in technical documentation and IFU specifies device provides 'quantitative data on clinical signs and interpretative distribution of ICD categories to support (not replace) healthcare professional assessment'. Model performance metrics documented in IFU: Top-k accuracy for ICD, RMAE for severity signs, AUC for binary indicators—with validation population descriptions and 95% CI. Model limitations documented: conditions outside ICD-11 categories not validated, performance varies by Fitzpatrick skin type, image quality affects accuracy. API documentation includes clear labeling of AI-generated outputs in response schema with metadata indicating AI provenance. Training materials for healthcare system integrators emphasize clinical interpretation responsibility and AI decision support role. Contraindications documented: not for use as sole diagnostic method, biopsy required for suspected malignancy regardless of AI output. Warnings about known edge cases: rare conditions, pediatric populations, unusual presentations. Post-market surveillance via GP-007 includes user feedback on understanding and appropriate use of AI features. Compliance with EU AI Act Article 13 transparency requirements for high-risk AI systems | Moderate (3) | Low (2) | 6 | Acceptable | YES | Lack of AI/ML transparency leads to over-reliance on AI outputs without critical evaluation, automation bias, and failure to recognize situations requiring clinical judgment per ISO 14971 use error analysis, IEC 62366-1 usability requirements, and EU AI Act Article 13 transparency obligations. | R-SKK |
| 18 | AI Risk | AI-RISK-018 | Model Retraining Failures: Performance Degradation After Update | When models are retrained with new data or updated algorithms, the retrained models perform worse than original validated models due to insufficient data, improper retraining procedures, or inadequate validation. | Device update introduces models with degraded performance, leading to increased diagnostic errors and compromised patient safety compared to previous version. | R-TF-028-001: Each model has specified performance thresholds. Retraining must maintain or improve performance. Update procedures referenced in risk management section. | Critical (4) | Moderate (3) | 12 | Tolerable | Retrained models follow identical development and validation procedures as original models (same protocols, metrics, thresholds). Retrained models evaluated on same independent test sets as original models for direct comparison. Acceptance criteria: retrained models must meet all original performance thresholds (non-inferiority) or demonstrate statistically significant improvement. Regression testing ensures retrained models do not introduce new failure modes. Clinical validation repeated for models with substantial architectural or data changes. Version control and traceability from retraining data through validation to deployment. Risk-benefit analysis for model updates considering potential performance changes. Predefined Change Control Plan (PCCP) specifies when model retraining is required and validation procedures. Regulatory notification/approval processes followed for significant model changes per MDR/RDC requirements | Critical (4) | Very low (1) | 4 | Acceptable | YES | Model retraining failures introduce performance degradation compared to validated versions, leading to increased diagnostic errors and compromised patient safety per ISO 14971 and MDR change control requirements. | R-SKK, R-VL1, R-75L |
| 19 | AI Risk | AI-RISK-019 | Inappropriate Update Triggers: Unnecessary Model Changes | Models are updated or retrained in response to non-critical triggers (minor performance variations, small data additions) causing unnecessary regulatory burden and introducing update-related risks without meaningful benefit. | Resources wasted on unnecessary updates, increased risk of introducing errors during update process, regulatory compliance burden, and potential device downtime during updates. | R-TF-028-001 Section 'Specifications and Risks': risks linked to AI/ML Risk Matrix. Update criteria need clear definition to prevent unnecessary changes. | Minor (2) | Moderate (3) | 6 | Acceptable | Predefined Change Control Plan (PCCP) clearly enumerates specific triggers requiring model updates (e.g., safety issues, significant performance drift, regulatory requirements, intended use expansion). Update decision criteria based on quantitative thresholds (e.g., >10% performance degradation, statistically significant subgroup disparities). Risk-benefit analysis required before initiating model update process. Regular scheduled reviews of model performance and update need (e.g., annual). Post-market surveillance data analyzed systematically to identify genuine update needs. Distinction between critical updates (safety-related, requiring immediate action) and non-critical improvements (can be batched). Documentation of update decision rationale in technical files. Stakeholder review (clinical, regulatory, technical) before committing to update process | Minor (2) | Very low (1) | 2 | Acceptable | NO | Unnecessary model updates are primarily a regulatory/operational burden risk. While update process may introduce errors, risk controls in update procedures address safety concerns per ISO 14971. | - |
| 20 | AI Risk | AI-RISK-020 | Model Obsolescence: Dataset No Longer Representative or Technology Outdated | Over time, patient population characteristics shift, new dermatological conditions emerge, imaging technology evolves, or AI algorithms advance, making current models obsolete and underperforming compared to state-of-art. | Gradual degradation of model performance, reduced diagnostic accuracy, and suboptimal patient care as clinical practice and patient demographics evolve beyond model training data. | R-TF-028-001: Models trained on current dermatological conditions and imaging modalities. Post-market surveillance mentioned in risk mitigation section. Technology watch needed for AI advancement. | Moderate (3) | Moderate (3) | 9 | Tolerable | Post-market surveillance system monitors model performance over time with real-world usage data (per SOP-24 or equivalent). Regular literature review and technology watch for AI/ML advancements in dermatology. Performance trending analysis identifies gradual degradation before clinical impact. Periodic re-validation on contemporary patient populations to assess ongoing performance. Dermatological conditions monitored for epidemiological changes or emerging conditions. User feedback and complaint systems capture performance concerns from clinical users. Scheduled review cycles (e.g., every 2-3 years) assess need for model updates based on technological advancement. PCCP includes obsolescence assessment criteria and triggers for major model updates. Modular architecture facilitates targeted model updates without complete system redesign. Training data includes diverse conditions and imaging modalities to maximize longevity | Moderate (3) | Low (2) | 6 | Acceptable | YES | Model obsolescence leads to gradual performance degradation as patient populations and technology evolve, reducing diagnostic accuracy and suboptimal patient care per ISO 14971 post-market surveillance requirements. | R-SKK, R-VL1, R-75L |
| 21 | AI Risk | AI-RISK-021 | Usability Issues: Model Outputs Not Interpretable by Clinical Users | AI model outputs (ICD probabilities, severity scores, binary indicators, wound assessments) are presented in a format that is confusing, difficult to interpret, or lacks clinical context, preventing effective use by healthcare professionals. | Clinicians unable to effectively utilize AI outputs for patient care, leading to device abandonment, misinterpretation of results, or incorrect clinical decisions based on misunderstood outputs. | R-TF-028-001: Each model outputs structured clinical information (probabilities, scores, classifications). Usability not explicitly detailed but critical for intended use fulfillment. | Moderate (3) | Moderate (3) | 9 | Tolerable | User interface design following clinical workflow and medical device usability principles (IEC 62366-1). Formative usability studies during development to iteratively refine output presentation. Summative usability validation (human factors testing) with representative users (dermatologists, primary care physicians, specialists). Clinical outputs accompanied by clear explanations and clinical context (e.g., ICD codes with disease names, severity scores with severity categories). Visual aids (charts, graphs, color coding) to enhance interpretation of quantitative outputs. Confidence indicators or uncertainty visualization to support clinical judgment. User training materials and documentation explain interpretation of all AI outputs. Clinical advisory board review of user interface and output presentation. Post-market feedback collection on usability and output interpretability. Iterative design improvements based on real-world user experience | Moderate (3) | Low (2) | 6 | Acceptable | YES | Non-interpretable AI outputs prevent effective use by healthcare professionals, potentially leading to device abandonment, misinterpretation, or incorrect clinical decisions per ISO 14971 and IEC 62366-1 usability requirements. | R-SKK |
| 22 | AI Risk | AI-RISK-022 | Clinical Model Failure: ICD Category Misclassification Leading to Incorrect Diagnosis Suggestion | ICD Category Distribution model assigns high probability to incorrect disease category among ICD-11 classes, potentially misleading clinician toward wrong diagnosis, particularly for visually similar conditions (e.g., melanoma vs. seborrheic keratosis, psoriasis vs. eczema) or rare diseases with limited training data. | Delayed correct diagnosis, inappropriate treatment initiation, or failure to recognize serious conditions requiring urgent intervention—most critically, melanoma misclassified as benign nevus leading to delayed cancer diagnosis and potential metastasis. | R-TF-028-001 Section 'ICD Category Distribution': Top-1 accuracy ≥50%, Top-3 accuracy ≥60%, Top-5 accuracy ≥70% (validated with 95% CI). Binary indicators (malignant, pre-malignant, urgent referral, high-priority referral with AUC ≥0.80) provide independent safety layer. Intended use: interpretative distribution to support (not replace) healthcare professional judgment. | Critical (4) | Moderate (3) | 12 | Tolerable | Top-5 ICD suggestions presented (not just top-1) to support differential diagnosis—ensures correct diagnosis typically in shortlist even if not ranked first. Six binary indicators (malignant, pre-malignant, associated with malignancy, pigmented lesion, urgent referral ≤48h, high-priority referral ≤2 weeks) provide independent safety checks beyond ICD classification. Performance thresholds validated per R-TF-028-001: Top-1 ≥50%, Top-3 ≥60%, Top-5 ≥70% ensure multi-option differential support. Urgent referral binary indicator (AUC ≥0.80) flags high-risk lesions requiring rapid evaluation regardless of specific ICD classification. Intended use statement (R-TF-Device Description) clearly states outputs are interpretative distributions to support (not replace) healthcare professional clinical judgment. User warnings in IFU about limitations of AI diagnosis and need for clinical correlation, biopsy for suspected malignancy. Clinical validation (R-TF-015-001) demonstrates AI-assisted diagnosis improves accuracy compared to physicians alone (literature: Han et al. 2020, Liu et al. 2020). Confidence scores (probability values) accompany predictions to indicate certainty level, enabling clinician judgment. Post-market surveillance monitors misclassification patterns via user feedback and serious adverse event reporting per GP-007. User training emphasizes differential diagnosis approach and clinical decision-making responsibility | Critical (4) | Low (2) | 8 | Tolerable | YES | ICD misclassification can mislead clinicians toward incorrect diagnoses, particularly for serious conditions like melanoma potentially causing delayed cancer diagnosis, inappropriate treatment, and patient harm per ISO 14971 harm assessment and MDR GSPR clinical benefit-risk requirements. | R-SKK |
| 23 | AI Risk | AI-RISK-023 | Clinical Model Failure: Visual Sign Severity Misquantification Leading to Incorrect Clinical Assessment | Visual sign quantification models (erythema RMAE ≤14%, desquamation RMAE ≤17%, induration RMAE ≤36%, pustule RMAE ≤30%, crusting/xerosis/swelling/oozing/excoriation/lichenification RMAE ≤20%) significantly over- or under-estimate severity beyond acceptable thresholds, providing inaccurate quantitative data on clinical signs to healthcare professionals. | Inaccurate visual sign severity data (e.g., erythema intensity underestimated or overestimated) may mislead healthcare professionals in their clinical assessment, potentially affecting treatment decisions made by the clinician based on the quantitative data provided. | R-TF-028-001: Each visual sign model specifies RMAE thresholds—erythema ≤14%, desquamation ≤17%, induration ≤36%, pustule ≤30%, crusting/xerosis/swelling/oozing/excoriation/lichenification ≤20%. All require performance superior to inter-observer variability with 95% CI. | Moderate (3) | Moderate (3) | 9 | Tolerable | Each visual sign model validated to specific RMAE thresholds per R-TF-028-001, demonstrated superior to typical inter-observer variability among experts. Multiple independent visual sign assessments provide redundancy—single model error does not affect other visual sign outputs. Performance validation against multi-expert consensus (minimum 3 dermatologists) ensures robust reference standard per R-TF-028-004. Visual sign severity data presented as quantitative data to support (not replace) clinical assessment—clinician retains full decision authority per intended use. Model calibration using temperature scaling ensures output probability distributions reflect true confidence per R-TF-028-002. Clinical validation studies assess correlation between automated visual sign quantification and dermatologist assessments per R-TF-015-001. Performance monitoring in post-market surveillance identifies systematic bias patterns via user feedback per GP-007. User guidance in IFU recommends clinical correlation and professional judgment for all treatment decisions | Moderate (3) | Low (2) | 6 | Acceptable | YES | Visual sign severity misquantification provides inaccurate quantitative data that may mislead healthcare professionals in clinical assessment, potentially affecting treatment decisions per ISO 14971 harm analysis. | R-SKK |
| 24 | AI Risk | AI-RISK-024 | Non-Clinical Model Failure: Domain Validation Error Routing Non-Skin Images to Clinical Analysis | Domain Validation non-clinical model (Lightweight Vision Transformer) incorrectly classifies non-skin images (text documents, random objects, non-skin body parts, internal organ images) as skin 'clinical' or 'dermoscopic' domain images, allowing them to proceed through the quality gates to clinical diagnostic models (ICD, severity, wound assessment). | Clinical models (ICD Category Distribution, Visual Sign Quantification, Wound Assessment) process inappropriate inputs not within their operational domain, producing meaningless outputs (random ICD probabilities, nonsensical severity scores) that could mislead clinicians if not recognized as invalid. | R-TF-028-001 Section 'Domain Validation': Non-clinical model using Lightweight Vision Transformer for three-class classification (clinical skin, dermoscopic skin, non-skin). Performance thresholds: overall accuracy ≥95%, non-skin precision ≥0.95, non-skin recall ≥0.90. Model serves as critical gateway in processing pipeline. | Moderate (3) | Low (2) | 6 | Acceptable | Domain Validation model achieves high performance thresholds per R-TF-028-001: overall accuracy ≥95%, non-skin precision ≥0.95 (high specificity for rejecting non-skin), non-skin recall ≥0.90. Conservative decision threshold favoring rejection of ambiguous inputs—prioritize specificity for skin acceptance over sensitivity. Multi-stage quality gates: Domain Validation → DIQA quality assessment → clinical analysis; multiple opportunities to reject inappropriate inputs. API response clearly indicates domain validation failure with appropriate error code when image rejected. Clinical models include additional confidence thresholds and sanity checks that may flag unusual input characteristics. User guidance in IFU specifies appropriate image types (clinical dermatological photographs, dermoscopic images) with examples. Post-market surveillance monitors domain validation failure patterns and user-reported inappropriate acceptances via GP-007. Logging and monitoring of domain validation decisions enables detection of systematic edge cases for model improvement | Moderate (3) | Very low (1) | 3 | Acceptable | YES | Domain validation errors allowing non-skin images to clinical analysis produce meaningless outputs that could mislead clinicians if not recognized as invalid per ISO 14971 and intended use requirements. | R-SKK, R-VL1 |
| 25 | AI Risk | AI-RISK-025 | Non-Clinical Model Failure: DIQA Incorrectly Accepts Poor Quality Images | Dermatology Image Quality Assessment (DIQA) non-clinical model (EfficientNet-based) assigns acceptable quality scores (≥6 on 0-10 scale) to poor-quality images (blurry, poorly lit, motion artifact, obstructed lesions, overexposed) allowing them to proceed to clinical analysis models that require diagnostic-quality inputs. | Clinical models (ICD Category Distribution, Visual Sign Quantification, Wound Assessment) analyze low-quality images with insufficient diagnostic information, resulting in unreliable ICD probability distributions, inaccurate severity scores (erythema intensity on overexposed image), and compromised clinical decision support. | R-TF-028-001 Section 'Dermatology Image Quality Assessment': EfficientNet-based non-clinical model with multi-dimensional quality assessment (focus, lighting, framing, artifacts, resolution). Performance thresholds: binary accept/reject accuracy ≥90%, sensitivity ≥85%, specificity ≥85%, Pearson correlation ≥0.80. Clinical threshold: score ≥6 for analysis. | Moderate (3) | Moderate (3) | 9 | Tolerable | DIQA model validated to performance thresholds per R-TF-028-001: binary accuracy ≥90%, sensitivity ≥85% (reject detection), specificity ≥85% (accept detection), Pearson correlation ≥0.80 with expert quality assessment. Multi-dimensional quality assessment evaluates: focus/sharpness, lighting adequacy, proper framing, absence of artifacts, sufficient resolution. Conservative acceptance threshold (≥6 on 0-10 scale) ensures only clearly acceptable diagnostic-quality images proceed to clinical analysis. Real-time quality score feedback via API response enables client applications to prompt immediate image retake for poor-quality submissions. User guidance in IFU on optimal imaging practices: lighting, distance, focus, avoiding motion blur. Clinical models may exhibit graceful performance degradation on borderline-quality images while maintaining safety margins. Post-market surveillance monitors correlation between DIQA scores and clinical model performance via GP-007 feedback mechanisms. User feedback mechanism allows clinicians to flag cases where quality assessment appeared inappropriate for model improvement. Periodic DIQA model re-validation on contemporary device camera characteristics and emerging image quality distributions | Moderate (3) | Low (2) | 6 | Acceptable | YES | DIQA accepting poor-quality images allows clinical models to analyze inputs with insufficient diagnostic information, resulting in unreliable ICD distributions and severity scores that compromise clinical decision support per ISO 14971. | R-SKK, R-VL1 |
| 26 | AI Risk | AI-RISK-026 | Clinical Model Failure: Binary Indicator False Negative for Malignancy/Urgent Referral | Binary indicator models (particularly 'malignant', 'pre-malignant', 'urgent referral ≤48h', 'high-priority referral ≤2 weeks') fail to flag high-risk lesions requiring immediate specialist evaluation, providing false reassurance when malignancy or urgent pathology is present. | Delayed diagnosis of malignancy (melanoma, squamous cell carcinoma, basal cell carcinoma) or failure to expedite urgent referrals for rapidly progressing lesions, leading to disease progression, potential metastasis, and significantly worse patient outcomes. | R-TF-028-001 Section 'Binary Indicators': Six indicators defined—malignant, pre-malignant, associated with malignancy, pigmented lesion, urgent referral (≤48h), high-priority referral (≤2 weeks). Each requires AUC ≥0.80 with 95% CI. Mapping matrix M_ij aggregates ICD probabilities to indicators. | Critical (4) | Moderate (3) | 12 | Tolerable | Binary indicators validated to AUC ≥0.80 per R-TF-028-001, ensuring good discriminative performance across all six indicators. Sensitivity optimization for critical safety indicators (malignant, urgent referral) during training to minimize false negatives even at cost of some false positives. Multiple redundant binary indicators provide overlapping coverage: malignant + pre-malignant + associated with malignancy + pigmented lesion + urgent referral + high-priority referral. Dermatologist-validated mapping matrix M_ij documented in R-TF-028-001 ensures appropriate ICD-11 categories contribute to each safety indicator. Intended use clearly positions device as decision support, not autonomous diagnostic system—clinician retains diagnostic responsibility per IFU. ICD category distribution provides independent information stream—high-risk diagnoses (melanoma, SCC, BCC) in top-5 may prompt clinical suspicion even if binary indicator subthreshold. Clinical validation (R-TF-015-001) includes assessment of missed high-risk cases and impact on diagnostic workflow safety. User warnings in IFU emphasize that negative binary indicators do not rule out serious disease—clinical judgment paramount, biopsy indicated for any suspicious lesion. Post-market surveillance specifically monitors malignancy detection performance and urgent referral appropriateness with serious adverse event reporting per GP-007. Periodic re-validation on emerging malignancy presentations and evolving referral guidelines (NICE, EADV) | Critical (4) | Low (2) | 8 | Tolerable | YES | False negatives for malignancy/urgent referral indicators provide false reassurance, potentially leading to delayed diagnosis of melanoma and other skin cancers, disease progression, metastasis, and significantly worse patient outcomes per ISO 14971 critical harm assessment and MDR GSPR clinical benefit-risk requirements. | R-SKK |
| 27 | AI Risk | AI-RISK-027 | Environmental Drift: Model Performance Degradation in Telemedicine vs. Clinical Settings | Models validated primarily on professional clinical photography exhibit degraded performance when used with patient self-captured images in telemedicine scenarios due to differences in imaging quality, framing, lighting, and technique. | Reduced diagnostic accuracy and unreliable severity assessments in telemedicine applications, limiting device utility and potentially causing misdiagnosis in remote care settings. | R-TF-028-001 Section 'Integration and Environment': models must handle variability in acquisition. Intended use includes both professional and patient-captured images. Validation requirements include diverse imaging contexts. | Moderate (3) | Moderate (3) | 9 | Tolerable | Training data includes both professional clinical photography and patient self-captured images to ensure robustness. DIQA model provides quality gate for both professional and patient-captured images with consistent thresholds. Real-time quality feedback during image capture helps patients achieve acceptable quality in telemedicine settings. Validation stratified by acquisition context (professional vs. patient-captured, clinical vs. telemedicine). User guidance and training specific to telemedicine image capture (patient education materials). Post-market surveillance monitors performance separately for telemedicine vs. in-clinic usage. Graceful degradation: models provide confidence indicators reflecting image quality and acquisition context. Clinical validation includes telemedicine use cases with representative patient-captured images | Moderate (3) | Low (2) | 6 | Acceptable | YES | Environmental drift causes performance degradation in telemedicine vs. clinical settings, reducing diagnostic accuracy and reliability in remote care per ISO 14971 and MDR GSPR requirements for intended use validation. | R-SKK, R-VL1 |
| 28 | AI Risk | AI-RISK-028 | Multi-Model Pipeline Failure: Cascading Errors Across Dependent Models | Errors in upstream non-clinical models (domain validation, DIQA, skin segmentation) propagate to downstream clinical models, compounding errors and producing highly unreliable clinical outputs. | Systematic failures in AI pipeline produce completely erroneous diagnostic suggestions or severity assessments when multiple models fail sequentially. | R-TF-028-001: Multiple models operate in pipeline (domain validation → DIQA → skin segmentation → clinical models). Integration requirements specify compatibility but error propagation needs management. | Critical (4) | Low (2) | 8 | Acceptable | Each model in pipeline validated independently with high performance thresholds to minimize individual failure probability. Quality gates at multiple stages (domain validation, DIQA) prevent propagation of clearly inappropriate inputs. Confidence scoring at each pipeline stage allows downstream models to account for upstream uncertainty. End-to-end integration testing validates full pipeline performance, not just individual model performance. Graceful degradation: pipeline failures result in output suppression or low-confidence flagging rather than erroneous high-confidence outputs. Monitoring and logging of pipeline stage outputs enables detection of systematic failure patterns. Clinical validation assesses real-world pipeline performance under diverse conditions. User interface indicates when outputs have low confidence due to quality or processing issues. Post-market surveillance monitors pipeline failure modes and cascading error patterns | Critical (4) | Very low (1) | 4 | Acceptable | YES | Cascading errors in multi-model pipeline compound to produce highly unreliable clinical outputs when multiple models fail sequentially, potentially causing systematic failures per ISO 14971. | R-SKK, R-VL1 |
| 29 | AI Risk | AI-RISK-029 | Regulatory Non-Compliance: AI Models Not Meeting MDR/RDC/EU AI Act Requirements | AI models do not meet regulatory requirements for clinical validation, performance documentation, risk management, transparency, or bias prevention under EU MDR 2017/745 (Class IIb), Brazilian RDC 751/2022 (Class II), or EU AI Act (high-risk AI system) requirements, jeopardizing regulatory approval and market access. | Regulatory rejection by BSI or ANVISA, inability to market device in EU/Brazil, delays in patient access to technology, significant financial losses, and potential legal consequences for non-compliant device commercialization. | R-TF-028-001: Device classified as Class IIb (MDR 2017/745 Rule 11), Class II (RDC 751/2022). AI models integral to intended use per R-TF-Device Description. EU AI Act: high-risk AI system (medical device AI). Full technical documentation required per MDR Annex II, MDCG 2020-1 (Clinical Evaluation of SaMD). | Critical (4) | Low (2) | 8 | Acceptable | Comprehensive AI/ML documentation suite per MDR Annex II and MDCG 2020-1: R-TF-028-001 (AI Description), R-TF-028-002 (Development Plan), R-TF-028-003 (Data Collection Instructions), R-TF-028-004 (Data Annotation Instructions), R-TF-028-005 (Development Report), R-TF-028-006 (Release Report), R-TF-028-009 (Design Checks), R-TF-028-011 (AI Risk Assessment). Clinical validation studies designed to meet MDR/RDC requirements for Class IIb device with AI per R-TF-015-001 Clinical Evaluation Plan and MDCG 2020-1 guidance. AI/ML Risk Assessment (this document) integrated with overall device risk management per ISO 14971:2019, with traceability to R-TF-013-001 Risk Management Plan. Transparency and explainability features per EU AI Act Article 13: intended use clarity, performance disclosure, limitation documentation in IFU. Bias and fairness assessment documented for Fitzpatrick skin types I-VI populations per EU AI Act Article 10 (data governance) and MDR GSPR 22 (non-discrimination). Post-market surveillance plan includes AI-specific performance monitoring per GP-007 and adverse event reporting per MDR Article 87. Quality management system (ISO 13485:2016 certification in progress with BSI) encompasses AI development per GP-028. Cybersecurity risk management addressing AI-specific threats per IEC 81001-5-1 and MDCG 2019-16. Technical documentation structured to address all MDR Annex II requirements, RDC 751/2022 Chapter III, and EU AI Act Annex IV. Regulatory strategy includes proactive engagement with BSI (NB 2797) and ANVISA for AI-specific guidance | Critical (4) | Very low (1) | 4 | Acceptable | NO | Regulatory non-compliance is primarily a market access and legal risk rather than direct patient safety risk. However, regulatory requirements (MDR GSPR, EU AI Act) are designed to ensure patient safety—compliance demonstrates safety assurance per ISO 14971 and MDR. | - |
Summary of Risk Assessment
Risk Distribution
After implementation of all mitigation measures:
| Risk Class | Count | Percentage |
|---|---|---|
| Acceptable (RPN ≤4) | 15 | 52% |
| Tolerable (RPN 5-9) | 14 | 48% |
| Unacceptable (RPN ≥10) | 0 | 0% |
Critical Risks Requiring Ongoing Monitoring
The following risks remain Tolerable (not Acceptable) after mitigation and require enhanced post-market surveillance:
- AI-RISK-016 (RPN 8): Model robustness to imaging variability
- AI-RISK-022 (RPN 8): ICD category misclassification
- AI-RISK-026 (RPN 8): Binary indicator false negatives for malignancy
These risks are monitored through:
- Post-market clinical follow-up (PMCF) per
GP-007 - User feedback analysis and complaint handling
- Periodic safety update reports (PSUR)
- Annual AI model performance review
Residual Risk Acceptability
All identified AI risks have been reduced to Acceptable or Tolerable levels through the implemented control measures. The overall residual risk is acceptable when weighed against the clinical benefits demonstrated in the Clinical Evaluation Report (R-TF-015-002):
- Improved diagnostic accuracy: AI-assisted diagnosis improves Top-5 accuracy vs. physicians alone
- Reduced inter-observer variability: Objective severity scoring reduces measurement variability
- Enhanced clinical workflow: Decision support improves efficiency without replacing clinical judgment
- Patient access: Enables remote dermatological assessment expanding access to specialist expertise
Integration with Device Risk Management
This AI Risk Assessment is integrated with the overall device risk management per ISO 14971:
- Risks transferred to safety: 21 of 29 AI risks (72%) are linked to device-level safety risks in
R-TF-013-002 - Residual risk evaluation: Combined with non-AI device risks for overall benefit-risk determination
- Change management: AI model updates follow Predefined Change Control Plan (PCCP) with risk re-evaluation
References
| Document | Title |
|---|---|
| R-TF-013-001 | Risk Management Plan |
| R-TF-013-002 | Risk Assessment |
| R-TF-028-001 | AI Description |
| R-TF-028-002 | AI Development Plan |
| R-TF-028-005 | AI Development Report |
| R-TF-015-001 | Clinical Evaluation Plan |
| R-TF-015-002 | Clinical Evaluation Report |
| GP-007 | Post-Market Surveillance |
| GP-028 | AI Development Procedure |
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:
- Author: JD-009
- Reviewer: JD-009
- Approver: JD-005