Skip to main content
QMSQMS
QMS
  • Welcome to your QMS
  • Quality Manual
  • Procedures
  • Records
  • Legit.Health Plus Version 1.1.0.0
    • Index
    • Overview and Device Description
    • Information provided by the Manufacturer
    • Design and Manufacturing Information
    • GSPR
    • Benefit-Risk Analysis and Risk Management
    • Product Verification and Validation
      • Software
      • Artificial Intelligence
        • R-TF-028-001 AI Description
        • R-TF-028-001 AI Development Plan
        • R-TF-028-003 Data Collection Instructions - Custom Gathered Data
        • R-TF-028-003 Data Collection Instructions - Archive Data
        • R-TF-028-004 Data Annotation Instructions - Visual Signs
        • R-TF-028-004 Data Annotation Instructions - Binary Indicator Mapping
        • R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping
        • R-TF-028-005 AI Development Report
        • R-TF-028 AI Release Report
        • R-TF-028 AI Design Checks
        • R-TF-028-011 AI Risk Assessment
      • Usability and Human Factors Engineering
      • Clinical
    • Design History File
    • Post-Market Surveillance
  • Legit.Health Plus Version 1.1.0.1
  • Licenses and accreditations
  • Applicable Standards and Regulations
  • Grants
  • Public tenders
  • Legit.Health Plus Version 1.1.0.0
  • Product Verification and Validation
  • Artificial Intelligence
  • R-TF-028-011 AI Risk Assessment

R-TF-028-011 AI Risk Assessment

Risk Assessment Table​

AnalyzeInitial EvaluationMitigationResidual evaluation
IDIssue typeIssue keySummaryRoot cause / Sequence of eventsConsequencesAI Specifications Originating the RiskInitial severityInitial likelihoodInitial RPNInitial risk classRisk control measuresResidual severityResidual likelihoodResidual RPNResidual risk classSafety risk issue key
1AI RiskAI-RISK-001Dataset Not Representative of Intended Use PopulationCollected dermatological images do not adequately represent the diversity of the target population (skin types, anatomical locations, conditions, demographics) specified in the intended use, leading to models that fail to generalize across all patient populations.AI models underperform on underrepresented patient subgroups (particularly darker Fitzpatrick skin types V-VI, rare conditions, specific anatomical sites), leading to diagnostic errors, incorrect severity assessments, and health inequities in clinical use.R-TF-028-001 Section 'Data Specifications': requires representative data across demographics, Fitzpatrick skin types I-VI, anatomical sites, and dermatological conditions. Multiple algorithm sections specify validation requirements across diverse populations.Critical (4)Moderate (3)12Tolerable
  • Data collection from multiple prospective sources (hospital dermatology departments) ensuring real-world diversity
  • Retrospective data collection from dermatological atlases including rare conditions
  • Documented demographics of collected datasets including Fitzpatrick skin type distribution, age, gender, anatomical sites, and condition prevalence (R-TF-028-005 Development Report)
  • Stratified sampling to ensure balanced representation across critical demographic and clinical variables
  • Bias analysis and fairness evaluation across Fitzpatrick skin types and other demographic factors
  • Independent evaluation on hold-out test sets with documented population characteristics
  • Performance reporting stratified by skin type, anatomical site, and other key variables
Critical (4)Very low (1)4Acceptable
2AI RiskAI-RISK-002Data Annotation Errors by Expert DermatologistsExpert dermatologists provide incorrect or inconsistent ground truth labels for ICD-11 categories, binary indicators, visual severity signs (erythema, desquamation, etc.), wound characteristics, or other annotations due to inter-observer variability, lack of clear annotation guidelines, or annotation fatigue.Models are trained and validated on erroneous or inconsistent ground truth, resulting in unreliable predictions, incorrect severity assessments, and misleading clinical outputs that could impact patient care.R-TF-028-001: All clinical models (ICD Category Distribution, Visual Sign Quantification, Wound Assessment) require expert dermatologist annotations. R-TF-028-004 Data Annotation Instructions specify annotation protocols.Critical (4)High (4)16Unacceptable
  • All annotations performed exclusively by board-certified dermatologists with appropriate sub-specialization
  • Comprehensive annotation training and calibration sessions for all annotators
  • Detailed, reproducible annotation instructions documented in R-TF-028-004 with visual examples and edge case guidance
  • Multi-expert annotation with consensus or adjudication protocols for final ground truth determination
  • Inter-rater agreement assessment (Cohen's κ, ICC) for all annotation tasks with documented thresholds
  • Annotation review system with senior dermatologist oversight and quality checks
  • Automated outlier detection to identify potentially erroneous annotations for re-review
  • Regular annotation quality audits and re-calibration sessions during data collection
Critical (4)Low (2)8Tolerable
3AI RiskAI-RISK-003Inadequate Model Evaluation Metrics or Test DataAI models are evaluated using inappropriate metrics, insufficient test data, or non-independent datasets, resulting in performance estimates that do not reflect real-world clinical performance.Deployed models perform worse than expected in clinical use, leading to incorrect diagnoses, severity misclassification, inappropriate treatment decisions, and loss of clinician trust.R-TF-028-001: Each algorithm specifies performance endpoints (Top-k Accuracy for ICD, RMAE for visual signs, AUC for binary indicators, accuracy for staging). Evaluation requirements documented throughout.Critical (4)Moderate (3)12Tolerable
  • Detailed evaluation reports for each algorithm documenting all specified metrics (R-TF-028-005 Development Report)
  • Each algorithm performance objective covered by dedicated evaluation against independent test sets
  • Strict sequestration of test data from training and validation sets (held-out, never used for model development)
  • Evaluation code unit tested and version controlled to ensure correctness
  • Performance reported with 95% confidence intervals for all metrics
  • Stratified evaluation across critical subgroups (Fitzpatrick skin types, anatomical sites, severity levels)
  • Comparison to expert dermatologist performance baselines where applicable
  • Statistical validation of model meeting or exceeding all specified thresholds before deployment
Critical (4)Very low (1)4Acceptable
4AI RiskAI-RISK-004Suboptimal Model Architecture or HyperparametersSelected deep learning architectures (CNNs, Vision Transformers, segmentation models) or hyperparameters (learning rate, regularization, loss functions) are inappropriate for dermatological image analysis tasks, leading to poor convergence, overfitting, or underfitting.Models fail to meet specified performance thresholds (Top-k accuracy, RMAE, AUC), resulting in clinically inadequate outputs that cannot support healthcare professionals in patient assessment.R-TF-028-001: Performance endpoints for all models (ICD ≥55% Top-1, Visual Signs RMAE ≤20%, Binary Indicators AUC ≥0.80, etc.). Development methodology in R-TF-028-002.Critical (4)Moderate (3)12Tolerable
  • Systematic hyperparameter optimization studies documented in development reports
  • Evaluation of multiple state-of-the-art architectures for each task (CNNs, Vision Transformers, hybrid approaches)
  • Use of transfer learning from large-scale dermatological and general image datasets
  • Validation on independent datasets to assess generalization before architecture selection
  • Regularization techniques (dropout, weight decay, data augmentation) to prevent overfitting
  • Early stopping and learning rate scheduling based on validation performance
  • Ablation studies to validate architectural choices and component contributions
  • Peer review of model architecture and training protocols by AI/ML specialists
Critical (4)Very low (1)4Acceptable
5AI RiskAI-RISK-005Cybersecurity: Model Extraction or Adversarial Input AttacksMalicious actors attempt to extract proprietary model weights/architecture or craft adversarial inputs designed to cause incorrect predictions, compromising model security and reliability.Model intellectual property is stolen, or adversarial attacks cause systematic misdiagnoses, severity misclassification, or safety-critical failures in clinical deployment.R-TF-028-001 Section 'Cybersecurity and Transparency': models deployed within Legit.Health Plus device, not as remote API. Integration environment specified.Critical (4)Moderate (3)12Tolerable
  • Models embedded directly within Legit.Health Plus software application (not cloud-based API), limiting remote attack surface
  • Model weights encrypted within deployment package
  • No remote model serving infrastructure that could be probed or attacked
  • Static models (no continuous learning or online updates) prevent poisoning attacks
  • Application-level security controls (authentication, authorization) restrict unauthorized access
  • Input validation and quality assessment (DIQA model) filters potentially malicious or out-of-distribution inputs
  • Adversarial robustness testing during validation phase
  • Security review and penetration testing of deployment architecture
Critical (4)Very low (1)4Acceptable
6AI RiskAI-RISK-006Bias and Fairness: Disparate Performance Across Fitzpatrick Skin TypesAI models exhibit significantly degraded performance on darker skin types (Fitzpatrick V-VI) compared to lighter skin types (I-III) due to dataset imbalance, lighting artifacts, or algorithm design, perpetuating health inequities.Patients with darker skin receive inaccurate diagnoses, severity assessments, and clinical recommendations, leading to delayed or inappropriate treatment and exacerbating healthcare disparities.R-TF-028-001: All clinical models require validation across Fitzpatrick skin types I-VI. Fitzpatrick Skin Type Identification (non-clinical model) supports bias mitigation. Data collection requirements specify skin tone diversity.Critical (4)High (4)16Unacceptable
  • Prospective data collection ensures representation across all Fitzpatrick types I-VI with documented distribution
  • Stratified performance evaluation with metrics reported separately for each Fitzpatrick type
  • Bias analysis and fairness audits conducted during development (R-TF-028-005 Development Report)
  • Fitzpatrick Skin Type Identification model enables skin type-aware quality control and performance monitoring
  • Data augmentation techniques preserve skin tone characteristics while increasing dataset diversity
  • Balanced training strategies to prevent model bias toward overrepresented skin types
  • Minimum performance thresholds enforced for each Fitzpatrick type subgroup
  • Post-market surveillance includes stratified performance monitoring by skin type
  • Clinical validation studies include diverse patient populations across all Fitzpatrick types
Critical (4)Low (2)8Tolerable
7AI RiskAI-RISK-007Model Training Failures: Overfitting or UnderfittingModels overfit to training data (memorizing rather than generalizing) or underfit (failing to learn relevant patterns), resulting in poor performance on new patient images during clinical deployment.Overfitting leads to excellent training performance but poor real-world generalization. Underfitting results in consistently poor performance. Both compromise clinical utility and patient safety.R-TF-028-001: All models specify performance thresholds on independent test sets. Development Plan (R-TF-028-002) defines training procedures and validation protocols.Critical (4)High (4)16Unacceptable
  • Training monitored using TensorBoard with validation metrics tracked throughout
  • Early stopping based on validation set performance prevents overfitting
  • Comprehensive regularization techniques: dropout, weight decay, batch normalization, data augmentation
  • Stratified train/validation/test splits ensure representative evaluation at all stages
  • K-fold cross-validation during hyperparameter tuning for robust parameter selection
  • Learning curve analysis to diagnose overfitting/underfitting and guide corrective actions
  • Independent test set performance as final acceptance criterion (never used during development)
  • Ablation studies validate that model complexity matches task difficulty
  • Training logs and model checkpoints version controlled for traceability and reproducibility
Critical (4)Very low (1)4Acceptable
8AI RiskAI-RISK-008Model Deployment Failures: Development vs. Deployed Performance MismatchModels converted for deployment (e.g., to TensorFlow Lite, ONNX, or mobile-optimized formats) exhibit different numerical outputs or degraded performance compared to development versions due to conversion errors, precision loss, or implementation bugs.Deployed models provide inaccurate predictions despite successful validation during development, leading to unreliable clinical outputs and potential patient harm.R-TF-028-001 Section 'Other Specifications': deployment conversion validated by prediction equivalence testing. R-TF-028-006 AI/ML Release Report documents deployment validation.Critical (4)Moderate (3)12Tolerable
  • Models deployed using validated frameworks compatible with development environment (e.g., TensorFlow → TensorFlow Lite)
  • Numerical equivalence testing: deployed models compared against development models on identical test inputs
  • Integration tests verify end-to-end pipeline produces expected outputs on reference images
  • Quantization and optimization validated to ensure accuracy degradation within acceptable bounds (typically <1%)
  • Visual inspection and statistical comparison of outputs from both development and deployed models
  • Version control and traceability from development models to deployed artifacts (R-TF-028-006 Release Report)
  • Automated regression testing suite runs with each model deployment
  • Clinical validation performed on final deployed models in target deployment environment
Critical (4)Very low (1)4Acceptable
9AI RiskAI-RISK-009Data Preprocessing Errors Destroying Clinically Relevant InformationImage preprocessing operations (resizing, normalization, augmentation) inadvertently remove or alter clinically important features (erythema intensity, lesion boundaries, texture patterns) critical for accurate model predictions.Models fail to learn or detect relevant dermatological features, resulting in poor diagnostic accuracy, incorrect severity assessment, and unreliable clinical outputs.R-TF-028-001: All image-based models require preservation of clinical features (erythema, desquamation, lesion morphology). Development methodology includes preprocessing pipeline definition.Critical (4)Moderate (3)12Tolerable
  • Multiple preprocessing strategies tested during development with ablation studies
  • Visual inspection of preprocessed images by dermatologists to confirm feature preservation
  • Preprocessing pipelines designed to preserve color accuracy (critical for erythema, pigmentation assessment)
  • Augmentation strategies validated to maintain clinical realism (e.g., brightness/contrast adjustments within physiological ranges)
  • Augmentation parameters constrained to prevent unrealistic transformations (e.g., no extreme rotations that violate anatomical constraints)
  • Preprocessing code unit tested with reference images and known expected outputs
  • Documentation of preprocessing rationale and clinical impact assessment (R-TF-028-005 Development Report)
  • Expert dermatologist review of augmentation examples to ensure clinical validity
Critical (4)Very low (1)4Acceptable
10AI RiskAI-RISK-010Incorrect Model Integration: Pre/Post-Processing Implementation ErrorsModels integrated into Legit.Health Plus software with incorrect pre-processing (e.g., wrong normalization) or post-processing (e.g., incorrect probability-to-score conversion, wrong BSA weighting) due to implementation bugs or documentation errors.Models produce incorrect outputs despite being correctly trained and validated, leading to erroneous diagnoses, severity scores, and clinical recommendations.R-TF-028-001: Specifies post-processing formulas for visual signs (weighted expected value), binary indicators (mapping matrix), PASI calculation (BSA weighting). R-TF-028-006 Release Report documents integration.Critical (4)Moderate (3)12Tolerable
  • Detailed integration specifications documented in R-TF-028-006 AI/ML Release Report including mathematical formulas and reference implementations
  • Unit tests for all pre-processing and post-processing functions with known input-output pairs
  • Integration tests comparing software implementation against validated Python reference implementation
  • End-to-end validation using reference images with known expected outputs from development
  • Code review of integration code by both AI/ML team and software engineering team
  • Regression testing suite executed with each software build
  • Clinical validation performed on final integrated system (not just standalone models)
  • Traceability from model specifications through implementation to validation results
Critical (4)Very low (1)4Acceptable
11AI RiskAI-RISK-011Insufficient Dataset Size for Model ComplexityCollected dataset is too small to support training of deep learning models with millions of parameters, particularly for rare conditions or underrepresented categories, resulting in poor generalization.Models perform poorly on rare conditions, minority skin types, or underrepresented anatomical sites, leading to systematic diagnostic failures for specific patient populations.R-TF-028-001 Section 'Data Specifications': requires large-scale data collection ([NUMBER OF IMAGES] dermatological images) with diversity across conditions, skin types, and anatomical sites. Multiple data sources specified.Critical (4)Moderate (3)12Tolerable
  • Multi-source data collection strategy: prospective hospital data + retrospective atlas data + evaluation hold-out sets
  • Targeted data collection for rare conditions and underrepresented categories through specialized sources
  • Data augmentation to increase effective training set size while preserving clinical realism
  • Transfer learning from large-scale pretrained models (ImageNet, dermatological datasets) to reduce data requirements
  • Sample size calculations and power analysis to determine adequate dataset sizes per category
  • Learning curve analysis to validate that additional data would not significantly improve performance
  • Documentation of dataset size and composition in R-TF-028-005 Development Report
  • Performance evaluation stratified by category prevalence to identify underperforming rare classes
  • Minimum sample size thresholds per ICD category, severity level, and demographic subgroup
Critical (4)Very low (1)4Acceptable
12AI RiskAI-RISK-012Data Collection Protocol FailuresImages collected during prospective data collection fail to meet quality standards, lack required metadata, or do not follow standardized imaging protocols, resulting in unusable or low-quality training data.Insufficient high-quality data for model development, leading to delayed development timelines or models trained on poor-quality data with suboptimal performance.R-TF-028-001 Section 'Data Specifications': prospective and retrospective data collection with quality requirements. R-TF-028-003 Data Collection Instructions define protocols.Moderate (3)Moderate (3)9Tolerable
  • Comprehensive data collection protocols documented in R-TF-028-003 with clear imaging standards, metadata requirements, and quality criteria
  • Training and certification of data collection personnel (photographers, clinicians)
  • Real-time quality checks during data collection with immediate feedback for non-compliant images
  • Standardized imaging equipment and settings specified in protocols
  • Metadata validation at point of collection to ensure completeness
  • Regular audits of collected data quality by AI/ML team with feedback to collection sites
  • DIQA (Image Quality Assessment) model applied to prospectively collected images to identify quality issues early
  • Iterative protocol refinement based on initial data collection experiences
  • Multiple data collection sites to ensure robustness to site-specific variations
Moderate (3)Low (2)6Acceptable
13AI RiskAI-RISK-013Cybersecurity: Data Breach During DevelopmentPatient images and associated clinical data collected for AI development are accessed by unauthorized parties due to inadequate data security controls during collection, storage, or processing.Patient privacy violations, regulatory non-compliance (GDPR, HIPAA), loss of patient trust, and potential legal/financial consequences for the organization.R-TF-028-001 Section 'Cybersecurity and Transparency': data de-identified/pseudonymized, research server restricted access, secure segregation required.Critical (4)Moderate (3)12Tolerable
  • All patient data de-identified/pseudonymized before transfer to AI development environment
  • Research servers with restricted access controls (authentication, authorization, role-based access)
  • Data encryption at rest and in transit (SSL/TLS for transfers, encrypted storage)
  • Network segregation isolating research data environment from public networks
  • Access logging and monitoring with regular security audits
  • Data processing agreements with all data sources and collaborators
  • Regular security training for all personnel with data access
  • Incident response plan for potential data breaches
  • Compliance with GDPR, HIPAA, and applicable data protection regulations
  • Data retention and deletion policies to minimize exposure window
Critical (4)Very low (1)4Acceptable
14AI RiskAI-RISK-014Poor Data Quality: Non-Diagnostic Images in Training SetTraining dataset contains significant proportion of poor-quality images (blurry, poorly lit, obstructed, wrong anatomical site) that were not filtered out during quality control, degrading model learning.Models learn from low-quality examples, reducing accuracy and potentially learning to accept poor-quality inputs that should be rejected, compromising clinical reliability.R-TF-028-001: DIQA (Image Quality Assessment) non-clinical model filters poor-quality images. Data quality requirements in Data Specifications section.Critical (4)Moderate (3)12Tolerable
  • DIQA (Dermatology Image Quality Assessment) model automatically filters images below quality threshold (score ≥6 for clinical use)
  • All images used for training/evaluation reviewed by expert dermatologists during annotation (quality confirmed during labeling)
  • Multi-stage quality checks by AI/ML team: automated quality metrics, visual inspection, outlier detection
  • Quality criteria defined in data collection protocols (R-TF-028-003)
  • Statistical analysis of dataset quality distributions documented in R-TF-028-005 Development Report
  • Quality-based stratified sampling ensures training set contains only diagnostic-quality images
  • Separate evaluation of model performance on varying quality levels to assess robustness
  • Ongoing quality monitoring during data collection with feedback loops to improve collection procedures
Critical (4)Very low (1)4Acceptable
15AI RiskAI-RISK-015Inadequate Development Environment and InfrastructureAI development environment lacks sufficient computational resources (GPUs), appropriate software libraries, or version control, leading to inefficient development, irreproducible results, or technical failures.Development delays, inability to train complex models effectively, poor model performance due to resource constraints, or inability to reproduce and validate results.R-TF-028-001 Section 'Other Specifications': fixed hardware/software stack required with version tracking. Development methodology requires reproducible environment.Moderate (3)Low (2)6Acceptable
  • Dedicated GPU-enabled workstations and cloud compute infrastructure for AI/ML development
  • State-of-the-art deep learning frameworks (TensorFlow, PyTorch) with version pinning
  • Containerized development environment (Docker) ensuring reproducibility and consistency across team
  • Version control for all code, models, and configurations (Git)
  • Dependency management with requirements.txt or conda environment files shared across team
  • Documented software stack versions in development reports (R-TF-028-005)
  • Regular infrastructure updates and maintenance
  • Backup and disaster recovery procedures for development data and model checkpoints
  • Continuous integration/continuous deployment (CI/CD) pipelines for automated testing
Moderate (3)Very low (1)3Acceptable
16AI RiskAI-RISK-016Model Robustness Failures: Sensitivity to Image Acquisition VariabilityModels are brittle to natural variations in imaging conditions (lighting, camera angle, distance, device type, background) commonly encountered in clinical practice, leading to inconsistent predictions.Model performance degrades significantly when images are captured under non-ideal conditions, limiting clinical utility and potentially causing diagnostic errors in real-world use.R-TF-028-001 Section 'Integration and Environment': models must handle variability in acquisition. All models specify validation across diverse imaging conditions and devices.Critical (4)Moderate (3)12Tolerable
  • Training data includes diverse imaging conditions (multiple devices, lighting, angles) from prospective clinical collection
  • Data augmentation simulating realistic imaging variations (brightness, contrast, rotation, noise)
  • Validation on images from multiple acquisition sources and device types
  • Color normalization and preprocessing techniques to improve robustness to lighting variations
  • DIQA model provides quality gate rejecting images outside acceptable acquisition parameter ranges
  • Performance evaluation stratified by imaging device, lighting condition, and other technical factors
  • Clinical validation studies using images from target deployment environments and device types
  • User guidance and training on optimal image acquisition practices to minimize extreme variability
  • Robustness testing with intentionally varied imaging conditions during validation
Critical (4)Low (2)8Tolerable
17AI RiskAI-RISK-017Lack of Transparency: Users Unaware of AI/ML Usage and LimitationsHealthcare professionals using Legit.Health Plus are not adequately informed that AI/ML algorithms generate outputs, do not understand model limitations, or are unaware that outputs require clinical interpretation.Over-reliance on AI outputs without critical evaluation, misuse of the device outside intended use, or failure to recognize situations where manual expert review is required.R-TF-028-001 Section 'Cybersecurity and Transparency': user documentation must state algorithm purpose, inputs/outputs, limitations, and that AI/ML is used. Transparency requirements throughout.Moderate (3)Moderate (3)9Tolerable
  • User manual clearly states that AI/ML algorithms are used in specified features (ICD suggestions, severity assessment, wound analysis)
  • Intended use statement specifies device provides quantitative data and interpretative distributions to support (not replace) healthcare professional assessment
  • Model limitations documented in user manual including: performance metrics, validation population, conditions where manual review recommended
  • User interface clearly indicates AI-generated outputs with appropriate labeling
  • Training materials and user education emphasize clinical interpretation responsibility
  • Confidence scores or uncertainty indicators displayed where appropriate to guide clinical judgment
  • Warnings and contraindications clearly documented for known failure modes or out-of-scope scenarios
  • Post-market surveillance includes user feedback on understanding and appropriate use of AI features
Moderate (3)Low (2)6Acceptable
18AI RiskAI-RISK-018Model Retraining Failures: Performance Degradation After UpdateWhen models are retrained with new data or updated algorithms, the retrained models perform worse than original validated models due to insufficient data, improper retraining procedures, or inadequate validation.Device update introduces models with degraded performance, leading to increased diagnostic errors and compromised patient safety compared to previous version.R-TF-028-001: Each model has specified performance thresholds. Retraining must maintain or improve performance. Update procedures referenced in risk management section.Critical (4)Moderate (3)12Tolerable
  • Retrained models follow identical development and validation procedures as original models (same protocols, metrics, thresholds)
  • Retrained models evaluated on same independent test sets as original models for direct comparison
  • Acceptance criteria: retrained models must meet all original performance thresholds (non-inferiority) or demonstrate statistically significant improvement
  • Regression testing ensures retrained models do not introduce new failure modes
  • Clinical validation repeated for models with substantial architectural or data changes
  • Version control and traceability from retraining data through validation to deployment
  • Risk-benefit analysis for model updates considering potential performance changes
  • Predefined Change Control Plan (PCCP) specifies when model retraining is required and validation procedures
  • Regulatory notification/approval processes followed for significant model changes per MDR/RDC requirements
Critical (4)Very low (1)4Acceptable
19AI RiskAI-RISK-019Inappropriate Update Triggers: Unnecessary Model ChangesModels are updated or retrained in response to non-critical triggers (minor performance variations, small data additions) causing unnecessary regulatory burden and introducing update-related risks without meaningful benefit.Resources wasted on unnecessary updates, increased risk of introducing errors during update process, regulatory compliance burden, and potential device downtime during updates.R-TF-028-001 Section 'Specifications and Risks': risks linked to AI/ML Risk Matrix. Update criteria need clear definition to prevent unnecessary changes.Minor (2)Moderate (3)6Acceptable
  • Predefined Change Control Plan (PCCP) clearly enumerates specific triggers requiring model updates (e.g., safety issues, significant performance drift, regulatory requirements, intended use expansion)
  • Update decision criteria based on quantitative thresholds (e.g., >10% performance degradation, statistically significant subgroup disparities)
  • Risk-benefit analysis required before initiating model update process
  • Regular scheduled reviews of model performance and update need (e.g., annual)
  • Post-market surveillance data analyzed systematically to identify genuine update needs
  • Distinction between critical updates (safety-related, requiring immediate action) and non-critical improvements (can be batched)
  • Documentation of update decision rationale in technical files
  • Stakeholder review (clinical, regulatory, technical) before committing to update process
Minor (2)Very low (1)2Acceptable
20AI RiskAI-RISK-020Model Obsolescence: Dataset No Longer Representative or Technology OutdatedOver time, patient population characteristics shift, new dermatological conditions emerge, imaging technology evolves, or AI algorithms advance, making current models obsolete and underperforming compared to state-of-art.Gradual degradation of model performance, reduced diagnostic accuracy, and suboptimal patient care as clinical practice and patient demographics evolve beyond model training data.R-TF-028-001: Models trained on current dermatological conditions and imaging modalities. Post-market surveillance mentioned in risk mitigation section. Technology watch needed for AI advancement.Moderate (3)Moderate (3)9Tolerable
  • Post-market surveillance system monitors model performance over time with real-world usage data (per SOP-24 or equivalent)
  • Regular literature review and technology watch for AI/ML advancements in dermatology
  • Performance trending analysis identifies gradual degradation before clinical impact
  • Periodic re-validation on contemporary patient populations to assess ongoing performance
  • Dermatological conditions monitored for epidemiological changes or emerging conditions
  • User feedback and complaint systems capture performance concerns from clinical users
  • Scheduled review cycles (e.g., every 2-3 years) assess need for model updates based on technological advancement
  • PCCP includes obsolescence assessment criteria and triggers for major model updates
  • Modular architecture facilitates targeted model updates without complete system redesign
  • Training data includes diverse conditions and imaging modalities to maximize longevity
Moderate (3)Low (2)6Acceptable
21AI RiskAI-RISK-021Usability Issues: Model Outputs Not Interpretable by Clinical UsersAI model outputs (ICD probabilities, severity scores, binary indicators, wound assessments) are presented in a format that is confusing, difficult to interpret, or lacks clinical context, preventing effective use by healthcare professionals.Clinicians unable to effectively utilize AI outputs for patient care, leading to device abandonment, misinterpretation of results, or incorrect clinical decisions based on misunderstood outputs.R-TF-028-001: Each model outputs structured clinical information (probabilities, scores, classifications). Usability not explicitly detailed but critical for intended use fulfillment.Moderate (3)Moderate (3)9Tolerable
  • User interface design following clinical workflow and medical device usability principles (IEC 62366-1)
  • Formative usability studies during development to iteratively refine output presentation
  • Summative usability validation (human factors testing) with representative users (dermatologists, primary care physicians, specialists)
  • Clinical outputs accompanied by clear explanations and clinical context (e.g., ICD codes with disease names, severity scores with severity categories)
  • Visual aids (charts, graphs, color coding) to enhance interpretation of quantitative outputs
  • Confidence indicators or uncertainty visualization to support clinical judgment
  • User training materials and documentation explain interpretation of all AI outputs
  • Clinical advisory board review of user interface and output presentation
  • Post-market feedback collection on usability and output interpretability
  • Iterative design improvements based on real-world user experience
Moderate (3)Low (2)6Acceptable
22AI RiskAI-RISK-022Clinical Model Failure: ICD Category Misclassification Leading to Incorrect Diagnosis SuggestionICD Category Distribution model assigns high probability to incorrect disease category, potentially misleading clinician toward wrong diagnosis, particularly for visually similar conditions or rare diseases.Delayed correct diagnosis, inappropriate treatment initiation, or failure to recognize serious conditions requiring urgent intervention (e.g., melanoma misclassified as benign nevus).R-TF-028-001 Section 'ICD Category Distribution': Top-1 accuracy ≥55%, Top-5 accuracy ≥80%. Binary indicators for malignancy risk and urgent referral provide additional safety layer. Intended use: interpretative distribution to support (not replace) healthcare professional judgment.Critical (4)Moderate (3)12Tolerable
  • Top-5 ICD suggestions presented (not just top-1) to support differential diagnosis and reduce single-classification risk
  • Binary indicators (malignant, pre-malignant, urgent referral) provide independent safety checks beyond ICD classification
  • Performance thresholds validated: Top-1 ≥55%, Top-3 ≥70%, Top-5 ≥80% ensure multi-option differential support
  • Urgent referral binary indicator (AUC ≥0.80) flags high-risk lesions requiring rapid evaluation regardless of specific ICD
  • Intended use clearly states outputs are interpretative distributions to support (not replace) healthcare professional clinical judgment
  • User warnings about limitations of AI diagnosis and need for clinical correlation
  • Clinical validation demonstrates AI-assisted diagnosis improves accuracy compared to physicians alone (literature-supported design)
  • Confidence scores accompany predictions to indicate certainty level
  • Post-market surveillance monitors misclassification patterns and serious adverse events
  • User training emphasizes differential diagnosis approach and clinical decision-making responsibility
Critical (4)Low (2)8Tolerable
23AI RiskAI-RISK-023Clinical Model Failure: Visual Sign Severity Misquantification Leading to Incorrect Treatment DecisionsVisual sign quantification models (erythema, desquamation, induration, pustules, etc.) significantly over- or under-estimate severity, leading to incorrect automated severity scoring (PASI, EASI) and potentially inappropriate treatment intensity.Under-treatment of severe disease leading to poor disease control and patient suffering, or over-treatment with unnecessary exposure to potent therapies and side effects.R-TF-028-001: Each visual sign model requires RMAE ≤20% compared to expert consensus. Multiple independent severity assessments combined into composite scores (PASI, EASI). Performance superior to inter-observer variability.Moderate (3)Moderate (3)9Tolerable
  • Each visual sign model validated to RMAE ≤20%, outperforming typical inter-observer variability among experts
  • Multiple independent visual sign assessments (erythema, induration, desquamation for PASI) provide redundancy—single model error unlikely to dominate composite score
  • Performance validation against multi-expert consensus (not single rater) ensures ground truth robustness
  • Clinical severity scores (PASI, EASI) integrate multiple components—systematic bias in one component dampened by others
  • Severity scores presented as quantitative data to support (not dictate) treatment decisions—clinician retains decision authority
  • Clinical validation studies assess correlation between automated scores and treatment decisions/outcomes
  • Performance monitoring in post-market surveillance identifies systematic bias patterns
  • User guidance recommends clinical correlation and manual verification for borderline treatment threshold cases
  • Confidence intervals or uncertainty quantification provided for severity scores where applicable
Moderate (3)Low (2)6Acceptable
24AI RiskAI-RISK-024Non-Clinical Model Failure: Domain Validation Error Routing Non-Skin Images to Clinical AnalysisDomain Validation model incorrectly classifies non-skin images (text documents, random objects, non-skin body parts) as skin clinical or dermoscopic images, allowing them to proceed to clinical diagnostic models.Clinical models process inappropriate inputs, producing meaningless or erroneous outputs that could mislead clinicians if not recognized as invalid.R-TF-028-001 Section 'Domain Validation': Non-skin precision ≥0.95, overall accuracy ≥95%. Model serves as critical gateway preventing inappropriate inputs to clinical models.Moderate (3)Low (2)6Acceptable
  • Domain Validation model achieves high performance thresholds: overall accuracy ≥95%, non-skin precision ≥0.95, non-skin recall ≥0.90
  • Conservative decision threshold favoring rejection of ambiguous inputs (prioritize specificity for skin acceptance)
  • Multi-stage quality gates: Domain Validation followed by DIQA quality assessment before clinical analysis
  • User interface warnings when domain validation fails, prompting image retake
  • Clinical models include additional sanity checks and confidence thresholds that may flag unusual inputs
  • User training emphasizes capturing appropriate dermatological images as per instructions
  • Post-market surveillance monitors domain validation failure patterns and user-reported inappropriate acceptances
  • Logging and monitoring of domain validation decisions for quality assurance and model performance tracking
Moderate (3)Very low (1)3Acceptable
25AI RiskAI-RISK-025Non-Clinical Model Failure: DIQA Incorrectly Accepts Poor Quality ImagesDermatology Image Quality Assessment (DIQA) model assigns acceptable quality scores (≥6) to poor-quality images (blurry, poorly lit, obstructed), allowing them to proceed to clinical analysis.Clinical models analyze low-quality images, resulting in unreliable or inaccurate outputs that compromise diagnostic accuracy and patient safety.R-TF-028-001 Section 'Dermatology Image Quality Assessment': Binary accept/reject accuracy ≥90%, sensitivity (reject detection) ≥85%, specificity (accept detection) ≥85%. Critical threshold: score ≥6 for clinical use.Moderate (3)Moderate (3)9Tolerable
  • DIQA model validated to high performance thresholds: binary accuracy ≥90%, sensitivity ≥85%, specificity ≥85%, correlation with experts ≥0.80
  • Multi-dimensional quality assessment (focus, lighting, framing, artifacts, resolution) provides robust evaluation
  • Conservative acceptance threshold (≥6 on 0-10 scale) ensures only clearly acceptable images proceed to clinical analysis
  • Real-time quality feedback during image capture enables immediate retake of poor-quality images
  • User guidance on optimal imaging practices reduces frequency of poor-quality submissions
  • Clinical models may exhibit graceful degradation on borderline-quality images (partial information still useful)
  • Post-market surveillance monitors correlation between image quality and clinical model performance
  • User feedback mechanism allows clinicians to flag cases where quality assessment appeared inappropriate
  • Periodic DIQA model re-validation on contemporary image quality distributions
Moderate (3)Low (2)6Acceptable
26AI RiskAI-RISK-026Clinical Model Failure: Binary Indicator False Negative for Malignancy/Urgent ReferralBinary indicator models (particularly malignant, urgent referral) fail to flag high-risk lesions requiring immediate specialist evaluation, providing false reassurance.Delayed diagnosis of malignancy (melanoma, SCC, BCC) or failure to expedite urgent referrals, leading to disease progression, metastasis, or worse patient outcomes.R-TF-028-001 Section 'Binary Indicators': Each binary indicator requires AUC ≥0.80 (average) and ≥0.75 (minimum). Urgent referral and malignancy indicators critical for patient safety.Critical (4)Moderate (3)12Tolerable
  • Binary indicators validated to AUC ≥0.80 (average) and ≥0.75 (minimum individual), ensuring good discriminative performance
  • Sensitivity optimization for critical safety indicators (malignancy, urgent referral) to minimize false negatives even at cost of some false positives
  • Multiple redundant binary indicators provide overlapping coverage (malignant, pre-malignant, associated with malignancy, urgent referral, high-priority referral)
  • Intended use clearly positions device as decision support, not autonomous diagnostic system—clinician retains diagnostic responsibility
  • ICD category distribution provides independent information stream—high-risk diagnoses in top-5 may prompt clinical suspicion even if binary indicator negative
  • Clinical validation includes assessment of missed high-risk cases and impact on diagnostic workflow
  • User warnings emphasize that negative binary indicators do not rule out serious disease—clinical judgment paramount
  • Post-market surveillance specifically monitors malignancy detection and urgent referral performance with serious adverse event reporting
  • Periodic re-validation on emerging malignancy presentations and evolving clinical guidelines
Critical (4)Low (2)8Tolerable
27AI RiskAI-RISK-027Non-Clinical Model Failure: Body Site Misidentification Affecting BSA CalculationsBody Site Identification model incorrectly classifies anatomical location (e.g., arm misclassified as leg), leading to incorrect BSA weighting in PASI calculation and erroneous severity scores.Inaccurate PASI or EASI scores due to incorrect regional BSA weighting, potentially leading to inappropriate treatment intensity decisions.R-TF-028-001 Section 'Body Site Identification': Overall accuracy ≥85%, weighted kappa ≥0.80, region-level accuracy ≥90%. Adjacent site tolerance ≥95%. Used for PASI/EASI BSA weighting (head/neck 10%, upper extremities 20%, trunk 30%, lower extremities 40%).Moderate (3)Low (2)6Acceptable
  • Body site model validated to high accuracy: overall ≥85%, region-level ≥90%, adjacent site tolerance ≥95%
  • Hierarchical classification (broad region → specific site) ensures at minimum correct regional assignment for PASI weighting
  • Adjacent site errors (e.g., wrist vs. hand) have minimal BSA impact as they belong to same broad region (upper extremities)
  • PASI calculation integrates multiple inputs (body site, lesion coverage, severity signs)—single body site error dampened in overall score
  • User interface may display identified body site, allowing clinician to verify or correct if obviously wrong
  • Clinical correlation encouraged—clinicians aware of anatomical site from clinical examination
  • Performance validation specifically assesses impact of body site errors on final PASI score accuracy
  • Post-market surveillance monitors body site identification accuracy and correlation with clinical assessments
Moderate (3)Very low (1)3Acceptable
28AI RiskAI-RISK-028Environmental Drift: Model Performance Degradation in Telemedicine vs. Clinical SettingsModels validated primarily on professional clinical photography exhibit degraded performance when used with patient self-captured images in telemedicine scenarios due to differences in imaging quality, framing, lighting, and technique.Reduced diagnostic accuracy and unreliable severity assessments in telemedicine applications, limiting device utility and potentially causing misdiagnosis in remote care settings.R-TF-028-001 Section 'Integration and Environment': models must handle variability in acquisition. Intended use includes both professional and patient-captured images. Validation requirements include diverse imaging contexts.Moderate (3)Moderate (3)9Tolerable
  • Training data includes both professional clinical photography and patient self-captured images to ensure robustness
  • DIQA model provides quality gate for both professional and patient-captured images with consistent thresholds
  • Real-time quality feedback during image capture helps patients achieve acceptable quality in telemedicine settings
  • Validation stratified by acquisition context (professional vs. patient-captured, clinical vs. telemedicine)
  • User guidance and training specific to telemedicine image capture (patient education materials)
  • Post-market surveillance monitors performance separately for telemedicine vs. in-clinic usage
  • Reference marker systems (for surface area quantification) designed for use by non-professionals
  • Graceful degradation: models provide confidence indicators reflecting image quality and acquisition context
  • Clinical validation includes telemedicine use cases with representative patient-captured images
Moderate (3)Low (2)6Acceptable
29AI RiskAI-RISK-029Multi-Model Pipeline Failure: Cascading Errors Across Dependent ModelsErrors in upstream non-clinical models (domain validation, DIQA, skin segmentation, body site identification) propagate to downstream clinical models, compounding errors and producing highly unreliable clinical outputs.Systematic failures in AI pipeline produce completely erroneous diagnostic suggestions or severity assessments when multiple models fail sequentially.R-TF-028-001: Multiple models operate in pipeline (domain validation → DIQA → skin segmentation → body site → clinical models). Integration requirements specify compatibility but error propagation needs management.Critical (4)Low (2)8Acceptable
  • Each model in pipeline validated independently with high performance thresholds to minimize individual failure probability
  • Quality gates at multiple stages (domain validation, DIQA) prevent propagation of clearly inappropriate inputs
  • Confidence scoring at each pipeline stage allows downstream models to account for upstream uncertainty
  • End-to-end integration testing validates full pipeline performance, not just individual model performance
  • Graceful degradation: pipeline failures result in output suppression or low-confidence flagging rather than erroneous high-confidence outputs
  • Monitoring and logging of pipeline stage outputs enables detection of systematic failure patterns
  • Clinical validation assesses real-world pipeline performance under diverse conditions
  • User interface indicates when outputs have low confidence due to quality or processing issues
  • Post-market surveillance monitors pipeline failure modes and cascading error patterns
Critical (4)Very low (1)4Acceptable
30AI RiskAI-RISK-030Regulatory Non-Compliance: AI Models Not Meeting MDR/RDC RequirementsAI models do not meet regulatory requirements for clinical validation, performance documentation, risk management, or transparency under EU MDR 2017/745 and Brazilian RDC 751/2022, jeopardizing regulatory approval and market access.Regulatory rejection, inability to market device, delays in patient access to technology, financial losses, and potential legal consequences.R-TF-028-001: Device classified as Class IIb under MDR 2017/745 and RDC 751/2022. AI models integral to intended use requiring full regulatory compliance. Documentation requirements specified throughout.Critical (4)Low (2)8Acceptable
  • Comprehensive AI/ML documentation aligned with regulatory requirements: R-TF-028-001 (AI Description), R-TF-028-002 (Development Plan), R-TF-028-003 (Data Collection), R-TF-028-004 (Annotation), R-TF-028-005 (Development Report), R-TF-028-006 (Release Report), R-TF-028-009 (Design Checks), R-TF-028-011 (Risk Assessment)
  • Clinical validation studies designed to meet MDR/RDC requirements for Class IIb device with AI
  • AI/ML Risk Assessment (this document) integrated with overall device risk management per ISO 14971
  • Transparency and explainability features (intended use clarity, performance disclosure, limitation documentation) per regulatory expectations
  • Post-market surveillance plan includes AI-specific performance monitoring and adverse event reporting
  • Quality management system (ISO 13485) encompasses AI development and validation processes
  • Cybersecurity risk management addressing AI-specific threats
  • Bias and fairness assessment documented for diverse populations
  • Regulatory consultation and feedback incorporated throughout development
  • Technical documentation structured to address all MDR Annex II and RDC requirements for AI-based devices
Critical (4)Very low (1)4Acceptable

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

  • Author: Team members involved
  • Reviewer: JD-003, JD-004
  • Approver: JD-001
Previous
R-TF-028 AI Design Checks
Next
Usability and Human Factors Engineering
  • Risk Assessment Table
All the information contained in this QMS is confidential. The recipient agrees not to transmit or reproduce the information, neither by himself nor by third parties, through whichever means, without obtaining the prior written permission of Legit.Health (AI LABS GROUP S.L.)