R-TF-028-003 Data Collection Instructions - Retrospective Data
Table of contents
- Purpose and Scope
- Context and Rationale
- Objectives
- Data Population Characteristics
- Study Design
- Study Type
- Data Collection Timeline
- Dataset Partitioning Strategy
- Original Image Acquisition Context
- Data Retrieval and Ingestion Procedure
- Step 1: Source Identification and Evaluation
- Step 2: Secure Data Retrieval
- Step 3: Initial Data Assessment
- Step 4: Curation and Standardization
- Step 5: De-identification Verification
- Step 6: Validation and Quality Assurance
- Step 7: Ingestion into Main Development Database
- Step 8: Documentation and Traceability
- Data Quality Assessment Criteria
- Collected Data Specification
- Other Specifications
Purpose and Scope
This document defines the systematic protocol for the retrospective collection of dermatological images and associated clinical metadata from publicly available medical datasets and validated online dermatological atlases. This protocol forms part of the data acquisition strategy for the development and validation of the AI/ML algorithms integrated into Legit.Health Plus, a Class IIb medical device under the EU Medical Device Regulation (MDR) 2017/745.
The retrospective dataset serves as the foundational training corpus, providing the scale, diversity, and clinical breadth necessary to develop robust, generalizable, and clinically safe AI/ML models for dermatological disease classification and assessment.
Context and Rationale
The development of high-performing, safe, and effective AI/ML algorithms for dermatological assessment, as intended for Legit.Health Plus, is critically dependent on the quality, diversity, and scale of the training and testing data [cite: 60-62, 430]. To build models that generalize across real-world clinical scenarios, it is essential to source data from a wide variety of contexts, patient populations, imaging conditions, and clinical settings [cite: 448-450].
Retrospective data collection from reputable, publicly available medical datasets and online atlases offers several key advantages [cite: 443]:
- Scale: Access to large volumes of expertly annotated dermatological images that would be impractical to collect prospectively within a reasonable timeframe.
- Diversity: Images captured across multiple institutions, geographic regions, patient demographics, and clinical contexts, enhancing model robustness and reducing institutional bias.
- Expert Validation: Many public repositories include diagnoses confirmed by histopathological analysis or consensus expert opinion, providing high-quality ground truth labels [cite: 442].
- Established Standards: Use of well-characterized, peer-reviewed datasets that are widely recognized in the dermatological AI research community, facilitating benchmarking and validation.
This approach ensures the creation of a comprehensive foundational dataset that is broad in scope, clinically representative, and suitable for training AI/ML models intended for deployment in diverse real-world clinical settings.
Objectives
The primary objectives of this retrospective data collection protocol are:
-
Scale and Heterogeneity: To gather a large-scale, heterogeneous dataset of dermatological images suitable for the training, validation, and testing of the AI/ML algorithms in Legit.Health Plus, with a target of >170,000 curated images from multiple validated sources.
-
Clinical Representativeness: To ensure the dataset is representative of the intended patient population and use environment, covering a wide spectrum of ICD-11 diagnostic categories, patient demographics (age, sex), and all six Fitzpatrick skin phototypes [cite: 67, 69, 452], thereby supporting algorithmic fairness and generalizability.
-
Ground Truth Establishment: To acquire expertly validated diagnostic labels and associated clinical metadata to establish a reliable ground truth for each image, enabling supervised learning and robust, clinically meaningful performance evaluation [cite: 521].
-
Imaging Modality Coverage: To include both clinical (macroscopic) and dermoscopic images, reflecting the multi-modal nature of contemporary dermatological practice [cite: 724, 725, 731].
-
Regulatory Compliance: To execute all data acquisition, processing, and usage activities in full compliance with applicable data protection regulations (GDPR), intellectual property laws, and the licensing terms of the source datasets, ensuring traceability and auditability as required under MDR 2017/745.
Data Population Characteristics
Data Sources and Recruitment Strategy
Data will be sourced retrospectively from publicly available medical datasets and validated online dermatological atlases [cite: 443]. The data sources have been selected based on the following criteria:
- Clinical Validity: Datasets must be peer-reviewed, widely cited in the scientific literature, and recognized by the dermatological and medical AI research communities.
- Diagnostic Quality: Images must be of sufficient resolution and quality to support clinical diagnosis, and labels must be provided or verified by qualified dermatologists or through histopathological confirmation.
- Licensing Compliance: Only datasets published under licenses that explicitly permit commercial use, modification, and redistribution for the intended purpose are included.
- Diversity: Preference is given to datasets that provide demographic metadata (age, sex, Fitzpatrick skin type) and cover a broad range of diagnostic categories.
- Data Governance: Datasets must be de-identified and compliant with applicable data protection standards.
Selected Repositories
The following publicly available dermatological repositories with commercial-use licenses have been identified and will be utilized:
PAD-UFES-20 (Skin Lesion Dataset)
- Description: A smartphone-captured skin lesion dataset from Brazil with clinical images and patient metadata.
- Number of Images: 2,298 clinical images from 1,373 patients
- License: CC BY 4.0 (Creative Commons Attribution 4.0 International)
- License URL: https://creativecommons.org/licenses/by/4.0/
- Source: Federal University of Espírito Santo, Brazil
- Notes: Includes six diagnostic categories with metadata on patient age, skin lesion location, Fitzpatrick skin type, and diagnostic information.
Fitzpatrick17k
- Description: A dataset designed to improve dermatology AI across diverse skin tones, labeled with Fitzpatrick skin type.
- Number of Images: 16,577 clinical images
- License: CC BY 4.0 (Creative Commons Attribution 4.0 International)
- License URL: https://creativecommons.org/licenses/by/4.0/
- Source: Multiple online sources, curated by MIT
- Notes: Specifically annotated for Fitzpatrick skin type (I-VI), covering 114 skin conditions, essential for ensuring algorithmic fairness across diverse populations.
SD-198 (Skin Disease Dataset)
- Description: A comprehensive dermatological image dataset covering 198 different skin diseases.
- Number of Images: ~6,584 clinical images
- License: CC BY 4.0 (Creative Commons Attribution 4.0 International)
- License URL: https://creativecommons.org/licenses/by/4.0/
- Source: Sun Yat-sen Memorial Hospital, Sun Yat-sen University, China
- Notes: Provides wide diagnostic coverage across inflammatory, infectious, and neoplastic skin conditions with clinical photographs.
Diverse Dermatology Images (DDI)
- Description: A dataset specifically curated to address skin tone diversity in dermatology AI training.
- Number of Images: ~656 clinical images across diverse skin tones
- License: CC0 1.0 Universal (Public Domain Dedication)
- License URL: https://creativecommons.org/publicdomain/zero/1.0/
- Source: Stanford University
- Notes: Emphasizes representation of darker skin tones (Fitzpatrick types IV-VI), addressing a critical gap in dermatological AI training data.
SKINL2 (Skin Lesion Longitudinal Dataset)
- Description: A longitudinal dataset with multiple time-point images of skin lesions.
- Number of Images: ~1,000+ clinical and dermoscopic images
- License: CC BY 4.0 (Creative Commons Attribution 4.0 International)
- License URL: https://creativecommons.org/licenses/by/4.0/
- Source: Multiple clinical institutions
- Notes: Includes temporal progression data, useful for understanding lesion evolution and change detection.
Total Estimated Images from Retrospective Sources: >27,000 images
Justification for Repository Selection
The selection of these repositories ensures:
- Comprehensive Diagnostic Coverage: The combined datasets cover hundreds of distinct skin conditions across the full spectrum of ICD-11 dermatological categories, including both common and rare presentations.
- Multi-Modal Imaging: Inclusion of both clinical (macroscopic) and dermoscopic images ensures the model can learn from both imaging modalities, reflecting contemporary clinical workflows.
- Demographic Diversity: Datasets such as Fitzpatrick17k and PAD-UFES-20 provide explicit Fitzpatrick skin type annotations, enabling the development of algorithms that perform equitably across diverse patient populations, a key requirement for AI/ML-based medical devices under MDR.
- Geographic and Institutional Diversity: Data sourced from multiple continents (Europe, North America, South America) and institutions reduces overfitting to specific acquisition protocols or patient populations.
- Regulatory Alignment: All selected datasets are published under open licenses, ensuring full legal compliance and traceability of data provenance, as required for medical device technical documentation.
Ethical and Legal Considerations
All retrospective data acquisition and use activities are conducted in full compliance with ethical and legal requirements:
Licensing Compliance
- All data collection and usage strictly adhere to the specific terms, conditions, and restrictions of the Creative Commons or equivalent open licenses under which the public datasets were published.
- For each dataset, the license type, version, and URL are documented to ensure full traceability and auditability.
- Attribution requirements are fulfilled as specified in each license, and any use restrictions (e.g., non-commercial, no-derivatives) are respected.
- License compatibility is verified to ensure that data from multiple sources can be legally combined and used for the intended commercial medical device application.
Data Protection and Privacy (GDPR Compliance)
- All source datasets are published as de-identified, with no personally identifiable information (PII) included.
- A mandatory de-identification verification step is included in the collection protocol (Section 6.4) to confirm the absence of residual identifiers such as patient names, dates, medical record numbers, or embedded EXIF metadata containing identifying information.
- Any data found to contain residual personal identifiers will be excluded from the dataset or securely anonymized using validated techniques.
- All data processing activities by AI Labs Group S.L. are conducted in accordance with the EU General Data Protection Regulation (GDPR) and the organization's internal data protection and privacy policies.
- Data storage, access control, and processing environments comply with the security and privacy requirements outlined in the organization's Quality Management System (QMS).
Ethical Use of Publicly Available Data
- The use of publicly available datasets for AI/ML development is an established and ethically accepted practice in medical AI research, provided that data is used in accordance with its intended purpose and license.
- The original data contributors (researchers, institutions) are acknowledged, and the scientific community's norms of data sharing and reuse are respected.
- While retrospective, de-identified data does not require additional informed consent under GDPR, AI Labs Group S.L. commits to the responsible and transparent use of all training data.
Institutional and Regulatory Oversight
- This data collection protocol is part of the technical documentation for Legit.Health Plus and is subject to internal quality assurance and regulatory oversight.
- Any changes to the data sources, inclusion/exclusion criteria, or collection procedures will be documented and subject to change control procedures as defined in the QMS.
Inclusion Criteria
Images and cases will be included in the retrospective dataset if they meet all of the following criteria:
-
Anatomical Scope: Images depict the epidermis, dermis, and associated cutaneous appendages (hair follicles, sebaceous glands, sweat glands) [cite: 119]. This includes skin lesions, rashes, eruptions, and other dermatological manifestations visible on the skin surface.
-
Diagnostic Labeling: Each case is accompanied by a confirmed diagnosis, classified using a recognized diagnostic taxonomy (e.g., ICD-10, ICD-11, or a dermatology-specific coding system that can be mapped to ICD-11). The diagnosis must have been provided by a qualified medical expert (board-certified dermatologist or equivalent) or confirmed through histopathological analysis [cite: 442].
-
Image Quality: Images are of sufficient diagnostic quality to be of clinical utility, meaning they are in focus, adequately lit, and free from major artifacts that would preclude diagnostic interpretation [cite: 687-689]. Minimum resolution and quality standards are defined in the Data Quality Assessment procedure (Section 6.3).
-
Modality: Both clinical (macroscopic) and dermoscopic (magnified, polarized/non-polarized) images are included [cite: 724, 725, 731], provided they meet quality and labeling standards.
-
Licensing: The image and associated metadata are distributed under a license that permits commercial use, modification, and redistribution for the purpose of medical device development.
-
De-identification: The data is fully de-identified, with no personally identifiable information present in the image files, filenames, or metadata.
Exclusion Criteria
Images and cases will be excluded from the retrospective dataset if they meet any of the following criteria:
-
Insufficient Image Quality: Images that are out of focus, poorly lit, overexposed, underexposed, or contain significant motion blur, obstructions, or artifacts that would preclude reliable diagnostic interpretation [cite: 685, 686]. Specific quality thresholds (e.g., minimum resolution, sharpness scores) are defined in the Data Quality Assessment procedure.
-
Inadequate Labeling: Cases with ambiguous, missing, conflicting, or unverified diagnostic labels [cite: 506, 507]. Images labeled only with non-specific terms (e.g., "rash," "lesion") without a specific diagnosis are excluded.
-
Uncertain Licensing or Usage Rights: Images for which the usage rights are unclear, incompletely documented, or do not explicitly permit commercial use and modification for medical device development.
-
Presence of Identifiable Information: Images or metadata containing residual personally identifiable information (e.g., patient names, faces, medical record numbers, dates of birth, tattoos with identifiable text, or EXIF metadata with GPS coordinates or timestamps that could be linked to individuals) that cannot be securely and completely removed through anonymization.
-
Out-of-Scope Anatomy: Images depicting anatomical sites or conditions outside the intended use of Legit.Health Plus (e.g., oral mucosa, genital mucosa, ophthalmological conditions, unless explicitly within the device's intended use).
-
Duplicate or Near-Duplicate Images: To prevent data leakage and inflated performance metrics, duplicate or near-duplicate images (e.g., multiple frames from the same lesion without meaningful variation) are identified and deduplicated, retaining only one representative image per unique case.
Study Design
This is a retrospective, multi-source, observational data collection protocol.
Study Type
- Retrospective: All data is pre-existing and has been previously collected and published by third-party institutions. No new patient data will be generated under this protocol.
- Multi-Source: Data is aggregated from multiple independent, geographically distributed datasets to maximize diversity and generalizability.
- Observational: The data reflects real-world clinical practice as it occurred, with no intervention or modification of clinical workflows.
Data Collection Timeline
- The initial data collection phase involves the acquisition and curation of the five primary datasets listed in Section 4.1.1, with a target of >27,000 curated images.
- The collection process is ongoing and iterative: as new validated public datasets become available and meet the inclusion criteria, they may be added to the training corpus following the same rigorous evaluation and quality assurance procedures.
- All additions to the dataset are subject to version control, documentation, and change management procedures as defined in the QMS.
Dataset Partitioning Strategy
The retrospectively collected data will be partitioned into:
- Training Set: Used for model training and hyperparameter tuning (~70-80% of data).
- Validation Set: Used for model selection, hyperparameter optimization, and iterative performance evaluation during development (~10-15% of data).
- Test Set (Internal): A portion of the retrospective data may be reserved as an internal test set for initial performance benchmarking (~10-15% of data).
The partitioning strategy, including randomization procedures and stratification by key covariates (diagnosis, skin type, imaging modality), is detailed in the R-TF-028-002 AI/ML Development Plan
.
Original Image Acquisition Context
As the data is collected retrospectively from multiple public sources, there is no single, standardized acquisition protocol [cite: 448]. The images will have been captured using a variety of devices (e.g., different digital cameras, smartphones, dermatoscopes from multiple manufacturers), under diverse clinical settings (academic medical centers, community clinics, mobile health initiatives), and by operators with varying levels of experience [cite: 449, 450].
Rationale for Accepting Acquisition Variability:
This inherent variability in imaging equipment, acquisition settings, lighting conditions, and operator technique is not a limitation but rather a deliberate strength of the retrospective data collection strategy [cite: 448, 449]. By training on data with high real-world variability, the resulting AI/ML models are more likely to:
- Generalize effectively to new, unseen imaging devices and clinical environments.
- Be robust to variations in image quality, lighting, and patient positioning that occur in routine clinical practice.
- Avoid overfitting to the idiosyncrasies of a single institution, device, or acquisition protocol.
This approach aligns with best practices for developing AI/ML-based medical devices intended for broad clinical deployment across diverse healthcare settings.
Data Retrieval and Ingestion Procedure
The retrospective data collection and ingestion process follows a systematic, quality-controlled workflow:
Step 1: Source Identification and Evaluation
- Identify candidate public datasets and dermatological atlases through literature review, consultation with dermatological AI experts, and monitoring of established data repositories (e.g., Kaggle, Harvard Dataverse, institutional repositories).
- Evaluate each candidate dataset against the selection criteria defined in Section 4.1.
- Document the evaluation outcome, including the dataset name, source, license type, number of images, diagnostic coverage, and demographic metadata availability.
- For datasets meeting the criteria, obtain formal access (download, API access, or data transfer agreement as applicable).
Step 2: Secure Data Retrieval
- Download the dataset (images and associated metadata files) using secure, authenticated channels (HTTPS, SFTP, or equivalent).
- Transfer the data into a temporary staging area within AI Labs Group S.L.'s secure research environment, which is access-controlled and compliant with the organization's information security policies.
- Verify the integrity of the downloaded data using checksums (e.g., MD5, SHA-256) provided by the dataset publisher, if available.
Step 3: Initial Data Assessment
- Perform an initial exploratory data analysis to understand the dataset structure, file formats, metadata schema, and label distribution.
- Generate summary statistics (number of images, number of unique diagnoses, distribution of demographics, imaging modalities).
- Identify any data quality issues, missing labels, or inconsistencies requiring resolution.
Step 4: Curation and Standardization
This critical step ensures data quality and consistency across all sources:
- Inclusion/Exclusion Filtering: Apply the criteria defined in Sections 4.3 and 4.4 to filter the dataset, retaining only images and cases that meet all requirements.
- Quality Assessment: Implement automated and manual quality control procedures:
- Automated: Check image resolution, file integrity, detect corrupted files, assess image sharpness/blur metrics.
- Manual: A sample of images from each dataset is reviewed by qualified personnel to confirm diagnostic quality and appropriateness.
- Diagnostic Label Standardization: Map all diagnostic labels to the ICD-11 classification system:
- Original datasets may use ICD-10, proprietary taxonomies, or free-text diagnoses.
- A qualified medical professional (dermatologist or equivalent) oversees the mapping process to ensure clinical accuracy.
- Ambiguous or obsolete terms are resolved through consultation with clinical experts.
- A mapping table is maintained for traceability and auditability.
- Metadata Enrichment and Harmonization: Standardize metadata fields across datasets:
- Patient demographics: age, sex, Fitzpatrick skin type (where available).
- Lesion characteristics: anatomical location, lesion size, clinical presentation.
- Image metadata: modality (clinical/dermoscopic), device type (where available), acquisition date (for temporal analysis, if relevant).
- Create a unified metadata schema and populate it for all images.
- File Organization: Organize data into a consistent directory structure with standardized filenames and formats:
- Images: Convert to a standard format (e.g., JPEG, PNG) at a consistent color depth.
- Metadata: Consolidate into structured files (e.g., CSV, JSON) with a unified schema.
Step 5: De-identification Verification
- Conduct a comprehensive review to ensure all data is fully de-identified:
- Automated Checks: Scan EXIF metadata for GPS coordinates, timestamps, camera owner names, and other potential identifiers; strip all EXIF data.
- Manual Review: Visually inspect a representative sample of images for faces, identifiable tattoos, names on patient gowns, visible medical record numbers, or other PII.
- Any data containing residual identifiers is either securely anonymized (e.g., face blurring, redaction) using validated techniques or excluded from the dataset.
- Document the de-identification verification process and outcome.
Step 6: Validation and Quality Assurance
- Perform a final validation step:
- Verify that all images open correctly and are not corrupted.
- Confirm that all metadata fields are populated and correctly formatted.
- Check for duplicates or near-duplicates within and across datasets using perceptual hashing or image similarity algorithms.
- Validate the consistency of diagnostic labels and metadata.
- Any issues identified are documented and resolved before proceeding.
Step 7: Ingestion into Main Development Database
- Ingest the curated, verified, and standardized data into the main AI/ML development database.
- The database is version-controlled to ensure traceability and reproducibility.
- Assign a unique identifier to each image and associated metadata record.
- Tag the data with the source dataset name, version, and ingestion date.
- The ingested data is prepared for partitioning into training, validation, and test sets as described in the
R-TF-028-002 AI/ML Development Plan
.
Step 8: Documentation and Traceability
- Maintain comprehensive records of all data sources, retrieval dates, version numbers, and processing steps.
- Document any exclusions, modifications, or quality issues encountered during the curation process.
- Ensure all documentation is retained as part of the technical documentation for regulatory purposes.
Data Quality Assessment Criteria
To ensure that only high-quality images suitable for training a medical device AI/ML model are included, the following quality assessment criteria are applied:
Image Quality Metrics:
- Resolution: Minimum resolution thresholds are defined based on imaging modality (e.g., ≥300×300 pixels for dermoscopic images, ≥600×600 pixels for clinical images).
- Sharpness: Automated blur detection algorithms assess image sharpness; images below a defined sharpness threshold are flagged for manual review or exclusion.
- Exposure: Images with significant overexposure or underexposure that obscure diagnostic features are excluded.
- Artifacts: Images with significant artifacts (e.g., hair obscuring the lesion, ruler obstructions, significant vignetting, compression artifacts) are excluded unless the diagnostic region is clearly visible and unaffected.
Label Quality:
- Specificity: Labels must specify a diagnosis at the disease level (e.g., "melanoma," "basal cell carcinoma") rather than non-specific descriptors (e.g., "lesion," "abnormal").
- Verification: Preference is given to diagnoses confirmed by histopathology or consensus expert review.
- Consistency: Labels are cross-checked for internal consistency (e.g., the same lesion should not have conflicting diagnoses in different metadata fields).
Metadata Completeness:
- Essential metadata (diagnosis, imaging modality) must be present for all included images.
- Demographic metadata (age, sex, skin type) is highly desirable but not mandatory; missingness is documented and considered in model evaluation.
Collected Data Specification
- Image files (e.g., JPG, PNG, DICOM).
- Metadata files (e.g., CSV, JSON) containing the ground truth diagnosis, and where available, patient demographics (age, sex, phototype) and other relevant clinical information.
Other Specifications
- No specific conditions are applied to the specific make or model of camera or dermatoscope used in the original acquisition to ensure real-world diversity[cite: 449, 450].
- No specific conditions are applied regarding the operator who performed the original examination, provided the resulting data meets the quality and inclusion criteria.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001