GP-031 Training Data Governance
Procedure Flowchart
Purpose
This procedure establishes the process for evaluating, approving, and documenting the legal, regulatory, and ethical basis for using datasets to train the AI models that power the device. It ensures that every dataset used for AI training has been assessed for copyright compliance, data protection requirements, and medical device regulatory obligations before it is incorporated into any training pipeline.
Team members regularly identify candidate datasets for AI training, and each dataset must be evaluated against a complex landscape of EU copyright law, GDPR, the Medical Device Regulation (MDR), and the AI Act before it can be used. This procedure provides a systematic, auditable framework for those evaluations.
Scope
This procedure applies to all datasets considered for use in training, validating, or testing the AI models developed by the company, including:
- Publicly available datasets (Kaggle, Hugging Face, RoboFlow, GitHub, academic repositories)
- Datasets obtained through clinical partnerships or data sharing agreements
- Datasets created internally from proprietary data
- Datasets derived from web scraping or automated collection
- Synthetic datasets generated from other data sources
- Pre-trained model weights or foundation models trained on third-party data
Out of scope: Datasets used solely for internal exploratory research that will never be used to train, validate, or test any model deployed in the device. However, if there is any possibility that research artifacts (weights, features, embeddings) may flow into production, this procedure applies.
Reference Documents
| Reference | Description |
|---|---|
| Directive (EU) 2019/790 | Copyright in the Digital Single Market (DSM Directive), Articles 3 and 4 (TDM exceptions) |
| Regulation (EU) 2024/1689 | EU Artificial Intelligence Act, Articles 10, 53 |
| Regulation (EU) 2016/679 | General Data Protection Regulation (GDPR), Articles 6, 9, 35, 89 |
| Regulation (EU) 2017/745 | Medical Device Regulation (MDR), Annex I, Annex XIV |
| Real Decreto-ley 24/2021 | Spain's transposition of the DSM Directive |
| MDCG 2025-6 (AIB 2025-1) | Guidance on AI Act / MDR interplay for AI medical devices |
| GPAI Code of Practice (July 2025) | EU AI Office code of practice for GPAI models |
| GP-028 | AI Development |
| GP-013 | Risk Management |
| GP-050 | Data Protection |
| GP-052 | Data Privacy Impact Assessment (DPIA) |
Terms and Definitions
- Text and Data Mining (TDM): Any automated analytical technique aimed at analysing text and data in digital form in order to generate information, including but not limited to patterns, trends, and correlations (DSM Directive, Art. 2(2)).
- TDM Opt-Out: A machine-readable reservation of rights by a rightsholder under Art. 4(3) of the DSM Directive, expressed through robots.txt, TDMRep protocol, HTML meta tags, or HTTP headers.
- Lawful Access: Access to content that is not restricted by technical protection measures or behind a paywall that the user has not paid for.
- Machine-Readable Opt-Out: A TDM reservation expressed in a format that can be automatically detected by software, such as robots.txt directives, TDMRep headers, or HTML meta tags.
- Anonymization: Irreversible de-identification of personal data such that no means reasonably likely to be used can re-identify the data subject. Takes data outside GDPR scope.
- Pseudonymization: Replacement of direct identifiers with codes, where re-identification remains possible with the key. Data remains subject to GDPR.
- Dataset Provenance: The complete chain of custody and origin of a dataset, including original source, license history, and any re-uploads or modifications.
- CC-BY-4.0: Creative Commons Attribution 4.0 International license, which permits commercial use with attribution.
- CC-BY-NC: Creative Commons Attribution-NonCommercial license, which does not permit commercial use under the license terms alone.
Responsibilities
JD-009 (AI Team)
- Identify candidate datasets for AI training.
- Initiate dataset evaluation requests by providing dataset URLs, license information, and intended use.
- Implement approved dataset usage conditions (attribution, retention limits, segregation).
- Archive evidence of lawful access at the time of data collection (screenshots, robots.txt copies, license snapshots).
JD-003
- Perform the legal and regulatory analysis for each dataset evaluation request.
- Execute the 4-step evaluation framework (license check, TDM opt-out check, GDPR assessment, MDR/AI Act documentation).
- Author the Dataset Usage Evaluation Report (R-031-001).
- Monitor changes in the legal landscape and update this procedure when necessary.
JD-004
- Review the Dataset Usage Evaluation Report for completeness and compliance with QMS requirements.
JD-001
- Approve or reject the dataset usage based on the evaluation report.
- Accept any residual legal or regulatory risk associated with dataset usage.
Detailed Process
Overview
When any team member identifies a dataset that could be used for AI training, validation, or testing, the following 4-step evaluation framework must be completed before the dataset is downloaded or used. The results are documented in a Dataset Usage Evaluation Report (R-031-001).
Step 1: License Check
For each dataset, identify and document:
- The stated license on the platform where the dataset is found (e.g., Kaggle, RoboFlow, Hugging Face, GitHub).
- The original source license, if the dataset has been re-uploaded to a different platform. Studies show that >50% of datasets on major platforms have licensing errors, so always trace back to the original source.
- The governing license, which is the most restrictive license in the provenance chain.
| License | Commercial Training? | TDM Override? | Action Required |
|---|---|---|---|
| CC-BY-4.0 | Yes | Yes (additional basis) | Provide attribution |
| CC-BY-NC-4.0 | No (under license) | Yes, if no opt-out | Proceed to Step 2 |
| CC-BY-SA-4.0 | Yes | Yes | Attribution; assess SA implications if sharing model |
| CC-BY-NC-SA-4.0 | No (under license) | Yes, if no opt-out | Proceed to Step 2 |
| Apache 2.0 | Yes | N/A | Include license notice |
| MIT | Yes | N/A | Include copyright notice |
| Custom academic | Case by case | Possibly | Legal review required |
| No license stated | Unclear | Yes, if no opt-out | Proceed to Step 2; high risk |
Re-uploaded datasets: When a dataset appears on a platform (Kaggle, RoboFlow, Hugging Face) with a permissive license but originates from a source with a more restrictive license, the re-uploader may have violated the original license. In these cases:
- Always trace back to the original source and check its license.
- If the original is restrictive, treat it as the governing license.
- Rely on Art. 4 TDM as the legal basis (not the incorrectly re-applied license).
- Document the provenance chain thoroughly in the evaluation report.
Step 2: TDM Opt-Out Check
This step is required when the dataset's license does not permit commercial use (e.g., CC-BY-NC) or when the license is unclear.
Legal basis: Under Article 4 of the DSM Directive, anyone with lawful access may perform TDM for any purpose, including commercial, unless the rightsholder has opted out via machine-readable means. This is confirmed by:
- The Hamburg Higher Regional Court (OLG Hamburg, 5 U 104/24, December 2025), which held that human-readable disclaimers alone are insufficient for a valid opt-out.
- The EU AI Act, Art. 53(1)(c), which explicitly references Art. 4(3) of the DSM Directive.
- Creative Commons' own guidance, which states that statutory exceptions override CC license terms.
For each dataset source URL, check:
- robots.txt: Visit the source domain's
/robots.txtand check for directives blocking AI/TDM crawlers. - TDMRep protocol: Check for TDM Reservation Protocol headers or metadata.
- HTML meta tags: Check for meta tags signaling TDM reservation.
- Terms of service: Review ToS for TDM-related clauses (relevant as evidence, though human-readable ToS alone are insufficient for a valid opt-out).
- Archive evidence: Take timestamped screenshots and save copies of robots.txt, HTTP headers, and any relevant metadata at the time of access.
If a machine-readable opt-out is found: The dataset cannot be used under TDM. Either seek a commercial license from the rightsholder or use an alternative dataset.
If no machine-readable opt-out is found: Art. 4 TDM permits the use regardless of the license's non-commercial restriction. Document the absence of opt-out and proceed to Step 3.
Step 3: GDPR Assessment
This step is required for all datasets, regardless of license. It is especially critical for datasets containing images of people.
Evaluate:
- Identifiable features: Do images contain faces, tattoos, distinctive birthmarks, scars, or other features that could identify a person?
- Embedded metadata: Is there EXIF data with patient names, dates, device IDs, or geolocation?
- Anonymization status: Are images truly anonymized (irreversible, to the standard set by the EDPB in Opinion 05/2014) or merely pseudonymized?
- Legal basis for processing: What Art. 6 + Art. 9 bases exist? Options include:
- Art. 9(2)(a): Explicit consent (impractical at scale for retrospective datasets).
- Art. 9(2)(i): Public interest in public health (requires Member State law).
- Art. 9(2)(j): Scientific research (requires appropriate safeguards under Art. 89(1)).
- DPIA requirement: If the dataset contains health data or images of identifiable persons, a Data Protection Impact Assessment must be conducted under GP-052.
Practical rule: Assume dermatology images are personal data (and health data under Art. 9) unless irreversible anonymization can be demonstrated.
Required safeguards when processing health data for AI training (per Art. 89(1)):
- Pseudonymization or anonymization to the extent possible.
- Data minimization: exclude unnecessary personal or health data.
- Access controls: restrict access to the dataset to authorized personnel.
- Memorization testing: assess whether the trained model can reproduce training data.
- Documentation of the research purpose and its public benefit.
Step 4: MDR / AI Act Documentation
For every approved dataset, document the following for inclusion in the technical file:
- Source identification: Where the data came from (platform, URL, access date).
- Legal basis: License type, TDM exception, consent status, or other applicable basis.
- Population representativeness: Demographics, skin types, conditions, geographic distribution.
- Labelling methodology: Who labelled the data, qualifications, inter-annotator agreement.
- Preprocessing steps: Any anonymization, pseudonymization, augmentation, or normalization applied.
- Bias assessment: Analysis of potential biases and mitigation measures.
- Version control: Dataset version and change tracking.
- Retention and deletion policy: How long data is retained and when it will be deleted.
This documentation is required by:
- MDR Annex I, Section 17 (safety and performance requirements for software).
- MDR Annex XIV (clinical evaluation data requirements).
- AI Act Art. 10 (data governance for high-risk AI systems).
- MDCG 2025-6 (training data documentation for AI medical devices).
Verification Before Use in the Device
A dataset evaluation report (R-031-001) approves a dataset for use in principle. However, before any approved dataset is incorporated into a model that will be deployed in the device, the following verification checklist must be completed and documented in the evaluation report:
| # | Verification Item | Responsible | Trigger |
|---|---|---|---|
| 1 | Evidence of lawful access and TDM opt-out checks archived with timestamps | JD-009 | At time of data collection |
| 2 | EXIF metadata stripped from all images | JD-009 | Before storage in any training pipeline |
| 3 | File names anonymized (any personal data in file names removed) | JD-009 | Before storage in any training pipeline |
| 4 | Images with identifiable features (faces, tattoos, etc.) flagged and assessed | JD-009 | Before use in any training pipeline |
| 5 | DPIA completed under GP-052 (if identifiable images are retained after item 4) | JD-003 | Before use in any training pipeline |
| 6 | Dataset provenance recorded in the technical file for notified body review | JD-009 | Before next notified body audit |
| 7 | Population representativeness assessed and documented under GP-028 | JD-009 | Before the trained model is used in the device |
| 8 | Bias assessment conducted and documented under GP-028 | JD-009 | Before the trained model is used in the device |
| 9 | TDM opt-out checks performed for all original sources (not just the first evaluated) | JD-009 | At time of data collection for each source |
Items 1–5 are blocking: the dataset must not be used in any training pipeline until they are verified.
Items 6–8 are required before deployment: the trained model must not be used in the device until they are documented.
Item 9 is ongoing: each new original source identified in the provenance chain must be checked at the time the data is actually downloaded.
The AI team (JD-009) updates the Verification Checklist section in the evaluation report (R-031-001) as each item is completed. JD-003 reviews the checklist before the trained model is deployed in the device.
Integration with GP-028
This procedure is triggered during the Data Collection phase of GP-028 (AI Development). Before any dataset is downloaded or incorporated into the training pipeline described in GP-028, a Dataset Usage Evaluation Report (R-031-001) must be completed and approved.
- GP-028 governs the technical AI development lifecycle (design, training, validation, release).
- GP-031 (this procedure) governs the legal and regulatory evaluation of each dataset.
- GP-013 governs risk management, including risks arising from dataset usage.
- GP-050 and GP-052 govern data protection and DPIA requirements.
Ongoing Monitoring
The legal landscape for AI training data is evolving rapidly. Key developments to monitor:
- CJEU Case C-250/25 (Like Company v. Google Ireland): Expected late 2026 or 2027, will set binding precedent on whether AI training engages reproduction rights and whether the TDM exception applies.
- Copyright Directive Review: Mandated by the DSM Directive for June 2026. The Commission will review the TDM exceptions and their interaction with AI.
- Spain's ECL Proposal: Extended collective licensing for AI training, in consultation since late 2024, final text not yet published as of March 2026.
- European Parliament proposals (February 2026): If enacted, could require itemized training content lists and remuneration obligations.
JD-003 is responsible for monitoring these developments and updating this procedure when the legal landscape changes materially.
License Analysis Reference
CC-BY-4.0 (Attribution)
- Commercial AI training: Permitted.
- Requirements: Attribution to the creator(s).
- TDM interaction: Art. 4 TDM also applies, providing an additional legal basis.
- Risk level: Low.
CC-BY-NC-4.0 (Attribution, Non-Commercial)
- Commercial AI training under the license alone: Not permitted. "NonCommercial" means uses not primarily intended for or directed toward commercial advantage or monetary compensation.
- TDM override: Art. 4 TDM can override the NC restriction if (a) access was lawful, (b) no machine-readable TDM opt-out exists.
- Critical check: Verify whether the dataset source has a machine-readable TDM reservation.
- Risk level: Medium. The TDM override argument is legally sound but has limited court precedent for this specific interaction.
CC-BY-SA-4.0 (Attribution, ShareAlike)
- Commercial AI training: Permitted (no NC restriction).
- ShareAlike implications: If the trained model or its outputs constitute an "adaptation," they must be shared under the same license. However, the prevailing view is that model weights are not an "adaptation" of individual training works. If Art. 4 TDM applies, the SA condition may not trigger.
- Risk level: Medium-low for training; higher if sharing the model publicly.
Apache 2.0
- Commercial AI training: Permitted.
- Requirements: Attribution, inclusion of license notice. Includes express patent grants from contributors.
- Risk level: Low.
MIT License
- Commercial AI training: Permitted.
- Requirements: Include copyright notice and license text.
- Caution: No express patent grant (unlike Apache 2.0).
- Risk level: Low.
Custom Academic / Institutional Licenses
- Must be evaluated case by case.
- Common restrictions include: "research use only," "non-commercial," "no redistribution."
- TDM exception may override "non-commercial" restrictions, but "no redistribution" of the dataset itself must still be respected.
- Risk level: High without individual review.
Records
| Record ID | Name | Description |
|---|---|---|
| R-031-001 | Dataset Usage Evaluation Report | Completed for each dataset or batch of related datasets evaluated for AI training use |
Associated Procedures
- GP-028 AI Development
- GP-013 Risk Management
- GP-050 Data Protection
- GP-052 Data Privacy Impact Assessment (DPIA)
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:
- Author: JD-003 Design & Development Manager
- Reviewer: JD-004 Quality Manager & PRRC
- Approver: JD-001 General Manager