R-TF-028-003 Data Collection Instructions - Archive Data

Table of contents

Purpose and Scope
Context and Rationale
Objectives
Data Population Characteristics
Study Design
Other Specifications
References

Purpose and Scope

This document defines the systematic protocol for the retrospective collection of dermatological images and associated clinical metadata from medical archives. This protocol forms part of the data acquisition strategy for the development and validation of the AI algorithms integrated into the device.

The retrospective dataset serves as the foundational training corpus, providing the scale, diversity, and clinical breadth necessary to develop robust, generalizable, and clinically safe AI models for dermatological disease classification and assessment. In addition, some sources with more robust diagnosis confirmation are reserved for sequestered testing to provide an unbiased evaluation of model performance.

Context and Rationale

The development of high-performing, safe, and effective AI algorithms with the goal of enhancing efficiency and accuracy of care delivery, is critically dependent on the quality, diversity, and scale of the training and testing data [1-3]. To build models that generalize across real-world clinical scenarios, it is essential to source data from a wide variety of contexts, patient populations, imaging conditions, and clinical settings [4-6].

Retrospective data collection from archives offers several key advantages [7]:

Scale: Access to large volumes of expertly annotated dermatological images that would be impractical to collect prospectively within a reasonable timeframe.
Diversity: Images captured across multiple institutions, geographic regions, patient demographics, and clinical contexts, enhancing model robustness and reducing institutional bias.
Expert Validation: Many public repositories and private databases include diagnoses confirmed by histopathological analysis or consensus expert opinion, providing high-quality ground truth labels [8].

This approach ensures the creation of a comprehensive foundational dataset that is broad in scope, clinically representative, and suitable for training and testing AI models intended for deployment in diverse real-world clinical settings.

Objectives

The primary objectives of this archive data collection protocol are:

Scale and Heterogeneity: To gather a large-scale, heterogeneous dataset of dermatological images suitable for the training, validation, and testing of the AI algorithms in Legit.Health Plus, with a target of at least >100,000 curated images from multiple validated sources.
Clinical Representativeness: To ensure the dataset is representative of the intended patient population and use environment, covering a wide spectrum of ICD-11 diagnostic categories, patient demographics (age, sex), and all six Fitzpatrick skin phototypes [9-11], thereby supporting algorithmic fairness and generalizability.
Ground Truth Establishment: To acquire expertly validated diagnostic labels and associated clinical metadata to establish a reliable ground truth for each image, enabling supervised learning and robust, clinically meaningful performance evaluation [12].
Imaging Modality Coverage: To include both clinical (macroscopic) and dermoscopic images, reflecting the multi-modal nature of contemporary dermatological practice [13-15].
Regulatory Compliance: To execute all data acquisition, processing, and usage activities in full compliance with applicable data protection regulations (GDPR), intellectual property laws, and the licensing terms of the source datasets, ensuring traceability and auditability as required under MDR 2017/745.

Data Population Characteristics

Data Sources and Recruitment Strategy

Data will be sourced retrospectively from private healthcare institutions and publicly available medical datasets and validated online dermatological atlases [7]. The data sources have been selected based on the following criteria:

Diagnostic Quality: Images must be of sufficient resolution and quality to support clinical diagnosis, and labels must be provided or verified by qualified dermatologists or through histopathological confirmation.
Licensing Compliance: Only datasets published under licenses that explicitly permit commercial use, modification, and redistribution for the intended purpose are included.
Diversity: Preference is given to datasets that provide demographic metadata (age, sex, Fitzpatrick skin type) and cover a broad range of diagnostic categories.
Data Governance: Datasets must be de-identified and compliant with applicable data protection standards.

Archive Selection and Documentation

All selected dermatological image archives must be documented in the technical file, including:

Dataset Name and Version: Official name and version number of the dataset.
Source and Access Information: Publisher, institution, repository URL, and access method (public download, API, data transfer agreement).
License Type: Specific license (e.g., Creative Commons CC-BY 4.0, CC0) with version and URL to license terms.
Sample Size: Total number of images available and number of images meeting inclusion criteria after curation.
Diagnostic Coverage: Range of diagnoses included, mapped to ICD-11 categories.
Demographic Metadata: Availability of age, sex, and Fitzpatrick skin type annotations.
Imaging Modality: Clinical (macroscopic), dermoscopic, or both.
Quality and Validation: Description of diagnostic validation method (histopathology, expert consensus, etc.).

A comprehensive list of all datasets utilized in model development, along with their detailed characteristics, is maintained in the technical documentation and updated as new data sources are added.

Justification for Repository Selection

The selection of these repositories ensures:

Comprehensive Diagnostic Coverage: The combined datasets cover hundreds of distinct skin conditions across the full spectrum of ICD-11 dermatological categories, including both common and rare presentations.
Multi-Modal Imaging: Inclusion of both clinical (macroscopic) and dermoscopic images ensures the model can learn from both imaging modalities, reflecting contemporary clinical workflows.
Demographic Diversity: Datasets such as Fitzpatrick17k and PAD-UFES-20 provide explicit Fitzpatrick skin type annotations, enabling the development of algorithms that perform equitably across diverse patient populations, a key requirement for AI-based medical devices under MDR.
Geographic and Institutional Diversity: Data sourced from multiple continents (Europe, North America, South America) and institutions reduces overfitting to specific acquisition protocols or patient populations.
Regulatory Alignment: All selected datasets are published under open licenses, ensuring full legal compliance and traceability of data provenance, as required for medical device technical documentation.

Ethical and Legal Considerations

All retrospective data acquisition and use activities are conducted in full compliance with ethical and legal requirements:

Licensing Compliance

All data collection and usage strictly adhere to the specific terms, conditions, and restrictions of the Creative Commons or equivalent open licenses under which the public datasets were published.
For each dataset, the license type, version, and URL are documented to ensure full traceability and auditability.
Attribution requirements are fulfilled as specified in each license, and any use restrictions (e.g., non-commercial, no-derivatives) are respected.
License compatibility is verified to ensure that data from multiple sources can be legally combined and used for the intended commercial medical device application.

All source datasets are published as de-identified, with no personally identifiable information (PII) included.
A mandatory de-identification verification step is included in the collection protocol (Section 6.4) to confirm the absence of residual identifiers such as patient names, dates, medical record numbers, or embedded EXIF metadata containing identifying information.
Any data found to contain residual personal identifiers will be excluded from the dataset or securely anonymized using validated techniques.
All data processing activities by AI Labs Group S.L. are conducted in accordance with the EU General Data Protection Regulation (GDPR) and the organization's internal data protection and privacy policies.
Data storage, access control, and processing environments comply with the security and privacy requirements outlined in the organization's Quality Management System (QMS).

Ethical Use of Publicly Available Data

The use of publicly available datasets for AI development is an established and ethically accepted practice in medical AI research, provided that data is used in accordance with its intended purpose and license.
The original data contributors (researchers, institutions) are acknowledged, and the scientific community's norms of data sharing and reuse are respected.
While retrospective, de-identified data does not require additional informed consent under GDPR, AI Labs Group S.L. commits to the responsible and transparent use of all training data.

Institutional and Regulatory Oversight

This data collection protocol is part of the technical documentation for Legit.Health Plus and is subject to internal quality assurance and regulatory oversight.
Any changes to the data sources, inclusion/exclusion criteria, or collection procedures will be documented and subject to change control procedures as defined in the QMS.

Inclusion Criteria

Images and cases will be included in the retrospective dataset if they meet all of the following criteria:

Anatomical Scope: Images depict the epidermis, dermis, and associated cutaneous appendages (hair follicles, sebaceous glands, sweat glands) [16]. This includes skin lesions, rashes, eruptions, and other dermatological manifestations visible on the skin surface.
Diagnostic Labeling: Each case is accompanied by a confirmed diagnosis, classified using a recognized diagnostic taxonomy (e.g., ICD-10, ICD-11, or a dermatology-specific coding system that can be mapped to ICD-11). The diagnosis must have been provided by a qualified medical expert (board-certified dermatologist or equivalent) or confirmed through histopathological analysis [8].
Image Quality: Images are of sufficient diagnostic quality to be of clinical utility, meaning they are in focus, adequately lit, and free from major artifacts that would preclude diagnostic interpretation [17-19]. Minimum resolution and quality standards are defined in the Data Quality Assessment procedure (Section 6.3).
Modality: Both clinical (macroscopic) and dermoscopic (magnified, polarized/non-polarized) images are included [13-15], provided they meet quality and labeling standards.
Licensing: The image and associated metadata are distributed under a license that permits commercial use, modification, and redistribution for the purpose of medical device development.
De-identification: The data is fully de-identified, with no personally identifiable information present in the image files, filenames, or metadata.

Exclusion Criteria

Images and cases will be excluded from the retrospective dataset if they meet any of the following criteria:

Insufficient Image Quality: Images that are out of focus, poorly lit, overexposed, underexposed, or contain significant motion blur, obstructions, or artifacts that would preclude reliable diagnostic interpretation [20, 21]. Specific quality thresholds (e.g., minimum resolution, sharpness scores) are defined in the Data Quality Assessment procedure.
Inadequate Labeling: Cases with ambiguous, missing, conflicting, or unverified diagnostic labels [22, 23]. Images labeled only with non-specific terms (e.g., "rash," "lesion") without a specific diagnosis are excluded.
Uncertain Licensing or Usage Rights: Images for which the usage rights are unclear, incompletely documented, or do not explicitly permit commercial use and modification for medical device development.
Presence of Identifiable Information: Images or metadata containing residual personally identifiable information (e.g., patient names, faces, medical record numbers, dates of birth, tattoos with identifiable text, or EXIF metadata with GPS coordinates or timestamps that could be linked to individuals) that cannot be securely and completely removed through anonymization.
Out-of-Scope Anatomy: Images depicting anatomical sites or conditions outside the intended use of Legit.Health Plus (e.g., oral mucosa, genital mucosa, ophthalmological conditions, unless explicitly within the device's intended use).
Duplicate or Near-Duplicate Images: To prevent data leakage and inflated performance metrics, duplicate or near-duplicate images (e.g., multiple frames from the same lesion without meaningful variation) are identified and deduplicated, retaining only one representative image per unique case.

Study Design

This is a retrospective, multi-source, observational data collection protocol.

Study Type

Retrospective: All data is pre-existing and has been previously collected and published by third-party institutions. No new patient data will be generated under this protocol.
Multi-Source: Data is aggregated from multiple independent, geographically distributed datasets to maximize diversity and generalizability.
Observational: The data reflects real-world clinical practice as it occurred, with no intervention or modification of clinical workflows.

Data Collection Timeline

The initial data collection phase involves the acquisition and curation of the five primary datasets listed in Section 4.1.1, with a target of >100,000 curated images.
All additions to the dataset are subject to version control, documentation, and change management procedures as defined in the QMS.

Original Image Acquisition Context

As the data is collected retrospectively from multiple public sources, there is no single, standardized acquisition protocol [4]. The images will have been captured using a variety of devices (e.g., different digital cameras, smartphones, dermatoscopes from multiple manufacturers), under diverse clinical settings (academic medical centers, community clinics, mobile health initiatives), and by operators with varying levels of experience [5, 6].

Rationale for Accepting Acquisition Variability:

This inherent variability in imaging equipment, acquisition settings, lighting conditions, and operator technique is not a limitation but rather a deliberate strength of the retrospective data collection strategy [4-6]. By training on data with high real-world variability, the resulting AI models are more likely to:

Generalize effectively to new, unseen imaging devices and clinical environments.
Be robust to variations in image quality, lighting, and patient positioning that occur in routine clinical practice.
Avoid overfitting to the idiosyncrasies of a single institution, device, or acquisition protocol.

This approach aligns with best practices for developing AI-based medical devices intended for broad clinical deployment across diverse healthcare settings.

Data Retrieval and Ingestion Procedure

The retrospective data collection and ingestion process follows a systematic, quality-controlled workflow:

Source Identification and Evaluation

Identify candidate dermatological image archives through literature review, consultation with dermatological AI experts, healthcare institutions, and monitoring of established data repositories (e.g., Kaggle, Harvard Dataverse, institutional repositories).
Evaluate each candidate dataset against the selection criteria defined in Section 4.1.
Document the evaluation outcome, including the dataset name, source, license type, number of images, diagnostic coverage, and demographic metadata availability.
For datasets meeting the criteria, obtain formal access (download, API access, or data transfer agreement as applicable).

Secure Data Retrieval

Download the dataset (images and associated metadata files) using secure, authenticated channels (HTTPS, SFTP, or equivalent).
Transfer the data into a temporary staging area within AI Labs Group S.L.'s secure research environment, which is access-controlled and compliant with the organization's information security policies.
Verify the integrity of the downloaded data using checksums (e.g., MD5, SHA-256) provided by the dataset publisher, if available.

Initial Data Assessment

Perform an initial exploratory data analysis to understand the dataset structure, file formats, metadata schema, and label distribution.
Generate summary statistics (number of images, number of unique diagnoses, distribution of demographics, imaging modalities).
Identify any data quality issues, missing labels, or inconsistencies requiring resolution.

Curation and Standardization

This critical step ensures data quality and consistency across all sources:

Inclusion/Exclusion Filtering: Apply the criteria defined in Sections 4.3 and 4.4 to filter the dataset, retaining only images and cases that meet all requirements.
Quality Assessment: Implement automated and manual quality control procedures:
- Automated: Check image resolution, file integrity, detect corrupted files, assess image sharpness/blur metrics.
- Manual: A sample of images from each dataset is reviewed by qualified personnel to confirm diagnostic quality and appropriateness.
Metadata Enrichment and Harmonization: Standardize metadata fields across datasets:
- Patient demographics: age, sex, Fitzpatrick skin type (where available).
- Lesion characteristics: anatomical location, lesion size, clinical presentation.
- Image metadata: modality (clinical/dermoscopic), device type (where available), acquisition date (for temporal analysis, if relevant).
- Create a unified metadata schema and populate it for all images.
File Organization: Organize data into a consistent directory structure with standardized filenames and formats:
- Images: Convert to a standard format (e.g., JPEG, PNG) at a consistent color depth.
- Metadata: Consolidate into structured files (e.g., CSV, JSON) with a unified schema.

De-identification Verification

Conduct a comprehensive review to ensure all data is fully de-identified:
- Automated Checks: Scan EXIF metadata for GPS coordinates, timestamps, camera owner names, and other potential identifiers; strip all EXIF data.
- Manual Review: Visually inspect a representative sample of images for faces, identifiable tattoos, names on patient gowns, visible medical record numbers, or other PII.
- Any data containing residual identifiers is either securely anonymized (e.g., face blurring, redaction) using validated techniques or excluded from the dataset.
Document the de-identification verification process and outcome.

Validation and Quality Assurance

Perform a final validation step:
- Verify that all images open correctly and are not corrupted.
- Confirm that all metadata fields are populated and correctly formatted.
- Check for duplicates or near-duplicates within and across datasets using perceptual hashing or image similarity algorithms.
- Validate the consistency of diagnostic labels and metadata.
Any issues identified are documented and resolved before proceeding.

Ingestion into Main Development Database

Ingest the curated, verified, and standardized data into the main AI development database.
The database is version-controlled to ensure traceability and reproducibility.
Assign a unique identifier to each image and associated metadata record.
Tag the data with the source dataset name, version, and ingestion date.
The ingested data is prepared for partitioning into training, validation, and test sets as described in the R-TF-028-002 AI Development Plan.

Documentation and Traceability

Maintain comprehensive records of all data sources, retrieval dates, version numbers, and processing steps.
Document any exclusions, modifications, or quality issues encountered during the curation process.
Ensure all documentation is retained as part of the technical documentation for regulatory purposes.

Data Quality Assessment Criteria

To ensure that only high-quality images suitable for training a medical device AI model are included, the following quality assessment criteria are applied:

Image Quality Metrics:

Resolution: Minimum resolution thresholds are defined based on imaging modality (e.g., ≥200×200 pixels for dermoscopic images, ≥400×400 pixels for clinical images).
Artifacts: Images with significant artifacts (e.g., hair obscuring the lesion, ruler obstructions, significant vignetting, compression artifacts) are excluded unless the diagnostic region is clearly visible and unaffected.

Label Quality:

Specificity: Labels must specify a diagnosis at the disease level (e.g., "melanoma," "basal cell carcinoma") rather than non-specific descriptors (e.g., "lesion," "abnormal").
Verification: Preference is given to diagnoses confirmed by histopathology or consensus expert review.
Consistency: Labels are cross-checked for internal consistency (e.g., the same lesion should not have conflicting diagnoses in different metadata fields).

Metadata Completeness:

Essential metadata (diagnosis, imaging modality) must be present for all included images.
Demographic metadata (age, sex, skin type) is highly desirable but not mandatory; missingness is documented and considered in model evaluation.

Collected Data Specification

Image files (e.g., JPG, PNG, DICOM).
Metadata files (e.g., CSV, JSON) containing the ground truth diagnosis, and where available, patient demographics (age, sex, phototype) and other relevant clinical information.

Other Specifications

No specific conditions are applied to the specific make or model of camera or dermatoscope used in the original acquisition to ensure real-world diversity [5, 6].
No specific conditions are applied regarding the operator who performed the original examination, provided the resulting data meets the quality and inclusion criteria.

References

Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. doi:10.1038/nature21056
Liu Y, Jain A, Eng C, et al. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26(6):900-908. doi:10.1038/s41591-020-0842-3
Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol. 2021;157(11):1362-1369. doi:10.1001/jamadermatol.2021.3129
Winkler JK, Fink C, Toberer F, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 2019;155(10):1135-1141. doi:10.1001/jamadermatol.2019.1735
Combalia M, Codella NC, Rotemberg V, et al. BCN20000: Dermoscopic lesions in the wild. arXiv:1908.02288 [cs.CV]. 2019.
Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. doi:10.1038/sdata.2018.161
Haenssle HA, Fink C, Schneiderbauer R, et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann Oncol. 2018;29(8):1836-1842. doi:10.1093/annonc/mdy166
Brinker TJ, Hekler A, Enk AH, et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur J Cancer. 2019;113:47-54. doi:10.1016/j.ejca.2019.04.001
Fitzpatrick TB. The validity and practicality of sun-reactive skin types I through VI. Arch Dermatol. 1988;124(6):869-871. doi:10.1001/archderm.124.6.869
Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 2018;154(11):1247-1248. doi:10.1001/jamadermatol.2018.2348
Groh M, Harris C, Soenksen LR, et al. Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:1820-1828.
Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;319(13):1317-1318. doi:10.1001/jama.2017.18391
Argenziano G, Soyer HP, Chimenti S, et al. Dermoscopy of pigmented skin lesions: results of a consensus meeting via the Internet. J Am Acad Dermatol. 2003;48(5):679-693. doi:10.1067/mjd.2003.281
Kittler H, Pehamberger H, Wolff K, Binder M. Diagnostic accuracy of dermoscopy. Lancet Oncol. 2002;3(3):159-165. doi:10.1016/s1470-2045(02)00679-4
Vestergaard ME, Macaskill P, Holt PE, Menzies SW. Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting. Br J Dermatol. 2008;159(3):669-676. doi:10.1111/j.1365-2133.2008.08713.x
Anatomy and physiology of the skin. In: Wolff K, Goldsmith LA, Katz SI, Gilchrest BA, Paller AS, Leffell DJ, eds. Fitzpatrick's Dermatology in General Medicine. 7th ed. McGraw-Hill; 2008:57-73.
Pacheco AG, Lima GR, Salomão AS, et al. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221. doi:10.1016/j.dib.2020.106221
Cassidy B, Kendrick C, Brodzicki A, Jaworek-Korjakowska J, Yap MH. Analysis of the ISIC image datasets: Usage, benchmarks and recommendations. Med Image Anal. 2022;75:102305. doi:10.1016/j.media.2021.102305
Bozorgtabar B, Mahapatra D, Vray G, Thiran JP. SALAD: Self-supervised aggregation learning for anomaly detection on X-Rays. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Lecture Notes in Computer Science. 2020;12261:468-478. doi:10.1007/978-3-030-59710-8_46
Halpern AC, Marghoob AA, Reiter O, et al. Remedying the lack of diversity in dermatology datasets: A framework for addressing performance gaps in machine learning. J Am Acad Dermatol. 2021;84(4):1143-1145. doi:10.1016/j.jaad.2020.09.069
Hosny KM, Kassem MA, Foaud MM. Skin melanoma classification using ROI and data augmentation with deep convolutional neural networks. Multimed Tools Appl. 2020;79(33-34):24029-24055. doi:10.1007/s11042-020-09067-2
Codella NC, Gutman D, Celebi ME, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). 2018:168-172. doi:10.1109/ISBI.2018.8363547
Mendonça T, Ferreira PM, Marques JS, Marcal AR, Rozeira J. PH² - A dermoscopic image database for research and benchmarking. Conf Proc IEEE Eng Med Biol Soc. 2013;2013:5437-5440. doi:10.1109/EMBC.2013.6610779

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Author: Team members involved
Reviewer: JD-003, JD-004
Approver: JD-001

Purpose and Scope​

Context and Rationale​

Objectives​

Data Population Characteristics​

Data Sources and Recruitment Strategy​

Archive Selection and Documentation​

Justification for Repository Selection​

Ethical and Legal Considerations​

Licensing Compliance​

Data Protection and Privacy (GDPR Compliance)​

Ethical Use of Publicly Available Data​

Institutional and Regulatory Oversight​

Inclusion Criteria​

Exclusion Criteria​

Study Design​

Study Type​

Data Collection Timeline​

Original Image Acquisition Context​

Data Retrieval and Ingestion Procedure​

Source Identification and Evaluation​

Secure Data Retrieval​

Initial Data Assessment​

Curation and Standardization​

De-identification Verification​

Validation and Quality Assurance​

Ingestion into Main Development Database​

Documentation and Traceability​

Data Quality Assessment Criteria​

Collected Data Specification​

Other Specifications​

References​