GP-031 Training Data Governance

Procedure Flowchart

Purpose

This procedure establishes the process for evaluating, approving, and documenting the legal, regulatory, and ethical basis for using datasets to train the AI models that power the device. It ensures that every dataset used for AI training has been assessed for copyright compliance, data protection requirements, and medical device regulatory obligations before it is incorporated into any training pipeline.

Team members regularly identify candidate datasets for AI training, and each dataset must be evaluated against a complex landscape of EU copyright law, GDPR, the Medical Device Regulation (MDR), and the AI Act before it can be used. This procedure provides a systematic, auditable framework for those evaluations.

Scope

This procedure applies to all datasets considered for use in training, validating, or testing the AI models developed by the company, including:

Publicly available datasets (Kaggle, Hugging Face, RoboFlow, GitHub, academic repositories)
Datasets obtained through clinical partnerships or data sharing agreements
Datasets created internally from proprietary data
Datasets derived from web scraping or automated collection
Synthetic datasets generated from other data sources
Pre-trained model weights or foundation models trained on third-party data

Out of scope: Datasets used solely for internal exploratory research that will never be used to train, validate, or test any model deployed in the device. However, if there is any possibility that research artifacts (weights, features, embeddings) may flow into production, this procedure applies.

Reference Documents

Reference	Description
Directive (EU) 2019/790	Copyright in the Digital Single Market (DSM Directive), Articles 3 and 4 (TDM exceptions)
Regulation (EU) 2024/1689	EU Artificial Intelligence Act, Articles 10, 53
Regulation (EU) 2016/679	General Data Protection Regulation (GDPR), Articles 6, 9, 35, 89
Regulation (EU) 2017/745	Medical Device Regulation (MDR), Annex I, Annex XIV
Real Decreto-ley 24/2021	Spain's transposition of the DSM Directive
MDCG 2025-6 (AIB 2025-1)	Guidance on AI Act / MDR interplay for AI medical devices
GPAI Code of Practice (July 2025)	EU AI Office code of practice for GPAI models
GP-028	AI Development
GP-013	Risk Management
GP-050	Data Protection
GP-052	Data Privacy Impact Assessment (DPIA)

Terms and Definitions

Text and Data Mining (TDM): Any automated analytical technique aimed at analysing text and data in digital form in order to generate information, including but not limited to patterns, trends, and correlations (DSM Directive, Art. 2(2)).
TDM Opt-Out: A machine-readable reservation of rights by a rightsholder under Art. 4(3) of the DSM Directive, expressed through robots.txt, TDMRep protocol, HTML meta tags, or HTTP headers.
Lawful Access: Access to content that is not restricted by technical protection measures or behind a paywall that the user has not paid for.
Machine-Readable Opt-Out: A TDM reservation expressed in a format that can be automatically detected by software, such as robots.txt directives, TDMRep headers, or HTML meta tags.
Anonymization: Irreversible de-identification of personal data such that no means reasonably likely to be used can re-identify the data subject. Takes data outside GDPR scope.
Pseudonymization: Replacement of direct identifiers with codes, where re-identification remains possible with the key. Data remains subject to GDPR.
Dataset Provenance: The complete chain of custody and origin of a dataset, including original source, license history, and any re-uploads or modifications.
CC-BY-4.0: Creative Commons Attribution 4.0 International license, which permits commercial use with attribution.
CC-BY-NC: Creative Commons Attribution-NonCommercial license, which does not permit commercial use under the license terms alone.

Responsibilities

JD-009 (AI Team)

Identify candidate datasets for AI training.
Initiate dataset evaluation requests by providing dataset URLs, license information, and intended use.
Implement approved dataset usage conditions (attribution, retention limits, segregation).
Archive evidence of lawful access at the time of data collection (screenshots, robots.txt copies, license snapshots).

JD-003

Perform the legal and regulatory analysis for each dataset evaluation request.
Execute the 4-step evaluation framework (license check, TDM opt-out check, GDPR assessment, MDR/AI Act documentation).
Author the Dataset Usage Evaluation Report (R-031-001).
Monitor changes in the legal landscape and update this procedure when necessary.

JD-004

Review the Dataset Usage Evaluation Report for completeness and compliance with QMS requirements.

JD-001

Approve or reject the dataset usage based on the evaluation report.
Accept any residual legal or regulatory risk associated with dataset usage.

Detailed Process

Overview

When any team member identifies a dataset that could be used for AI training, validation, or testing, the following 4-step evaluation framework must be completed before the dataset is downloaded or used. The results are documented in a Dataset Usage Evaluation Report (R-031-001).

Step 1: License Check

For each dataset, identify and document:

The stated license on the platform where the dataset is found (e.g., Kaggle, RoboFlow, Hugging Face, GitHub).
The original source license, if the dataset has been re-uploaded to a different platform. Studies show that >50% of datasets on major platforms have licensing errors, so always trace back to the original source.
The governing license, which is the most restrictive license in the provenance chain.

License	Commercial Training?	TDM Override?	Action Required
CC-BY-4.0	Yes	Yes (additional basis)	Provide attribution
CC-BY-NC-4.0	No (under license)	Yes, if no opt-out	Proceed to Step 2
CC-BY-SA-4.0	Yes	Yes	Attribution; assess SA implications if sharing model
CC-BY-NC-SA-4.0	No (under license)	Yes, if no opt-out	Proceed to Step 2
Apache 2.0	Yes	N/A	Include license notice
MIT	Yes	N/A	Include copyright notice
Custom academic	Case by case	Possibly	Legal review required
No license stated	Unclear	Yes, if no opt-out	Proceed to Step 2; high risk

Re-uploaded datasets: When a dataset appears on a platform (Kaggle, RoboFlow, Hugging Face) with a permissive license but originates from a source with a more restrictive license, the re-uploader may have violated the original license. In these cases:

Always trace back to the original source and check its license.
If the original is restrictive, treat it as the governing license.
Rely on Art. 4 TDM as the legal basis (not the incorrectly re-applied license).
Document the provenance chain thoroughly in the evaluation report.

Step 2: TDM Opt-Out Check

This step is required when the dataset's license does not permit commercial use (e.g., CC-BY-NC) or when the license is unclear.

Legal basis: Under Article 4 of the DSM Directive, anyone with lawful access may perform TDM for any purpose, including commercial, unless the rightsholder has opted out via machine-readable means. This is confirmed by:

The Hamburg Higher Regional Court (OLG Hamburg, 5 U 104/24, December 2025), which held that human-readable disclaimers alone are insufficient for a valid opt-out.
The EU AI Act, Art. 53(1)(c), which explicitly references Art. 4(3) of the DSM Directive.
Creative Commons' own guidance, which states that statutory exceptions override CC license terms.

For each dataset source URL, check:

robots.txt: Visit the source domain's /robots.txt and check for directives blocking AI/TDM crawlers.
TDMRep protocol: Check for TDM Reservation Protocol headers or metadata.
HTML meta tags: Check for meta tags signaling TDM reservation.
Terms of service: Review ToS for TDM-related clauses (relevant as evidence, though human-readable ToS alone are insufficient for a valid opt-out).
Archive evidence: Take timestamped screenshots and save copies of robots.txt, HTTP headers, and any relevant metadata at the time of access.

If a machine-readable opt-out is found: The dataset cannot be used under TDM. Either seek a commercial license from the rightsholder or use an alternative dataset.

If no machine-readable opt-out is found: Art. 4 TDM permits the use regardless of the license's non-commercial restriction. Document the absence of opt-out and proceed to Step 3.

This step is required for all datasets, regardless of license. It is especially critical for datasets containing images of people.

Evaluate:

Identifiable features: Do images contain faces, tattoos, distinctive birthmarks, scars, or other features that could identify a person?
Embedded metadata: Is there EXIF data with patient names, dates, device IDs, or geolocation?
Anonymization status: Are images truly anonymized (irreversible, to the standard set by the EDPB in Opinion 05/2014) or merely pseudonymized?
Legal basis for processing: What Art. 6 + Art. 9 bases exist? Options include:
- Art. 9(2)(a): Explicit consent (impractical at scale for retrospective datasets).
- Art. 9(2)(i): Public interest in public health (requires Member State law).
- Art. 9(2)(j): Scientific research (requires appropriate safeguards under Art. 89(1)).
DPIA requirement: If the dataset contains health data or images of identifiable persons, a Data Protection Impact Assessment must be conducted under GP-052.

Practical rule: Assume dermatology images are personal data (and health data under Art. 9) unless irreversible anonymization can be demonstrated.

Required safeguards when processing health data for AI training (per Art. 89(1)):

Pseudonymization or anonymization to the extent possible.
Data minimization: exclude unnecessary personal or health data.
Access controls: restrict access to the dataset to authorized personnel.
Memorization testing: assess whether the trained model can reproduce training data.
Documentation of the research purpose and its public benefit.

Step 4: MDR / AI Act Documentation

For every approved dataset, document the following for inclusion in the technical file:

Source identification: Where the data came from (platform, URL, access date).
Legal basis: License type, TDM exception, consent status, or other applicable basis.
Population representativeness: Demographics, skin types, conditions, geographic distribution.
Labelling methodology: Who labelled the data, qualifications, inter-annotator agreement.
Preprocessing steps: Any anonymization, pseudonymization, augmentation, or normalization applied.
Bias assessment: Analysis of potential biases and mitigation measures.
Version control: Dataset version and change tracking.
Retention and deletion policy: How long data is retained and when it will be deleted.

This documentation is required by:

MDR Annex I, Section 17 (safety and performance requirements for software).
MDR Annex XIV (clinical evaluation data requirements).
AI Act Art. 10 (data governance for high-risk AI systems).
MDCG 2025-6 (training data documentation for AI medical devices).

Verification Before Use in the Device

A dataset evaluation report (R-031-001) approves a dataset for use in principle. However, before any approved dataset is incorporated into a model that will be deployed in the device, the following verification checklist must be completed and documented in the evaluation report:

#	Verification Item	Responsible	Trigger
1	Evidence of lawful access and TDM opt-out checks archived with timestamps	JD-009	At time of data collection
2	EXIF metadata stripped from all images	JD-009	Before storage in any training pipeline
3	File names anonymized (any personal data in file names removed)	JD-009	Before storage in any training pipeline
4	Images with identifiable features (faces, tattoos, etc.) flagged and assessed	JD-009	Before use in any training pipeline
5	DPIA completed under GP-052 (if identifiable images are retained after item 4)	JD-003	Before use in any training pipeline
6	Dataset provenance recorded in the technical file for notified body review	JD-009	Before next notified body audit
7	Population representativeness assessed and documented under GP-028	JD-009	Before the trained model is used in the device
8	Bias assessment conducted and documented under GP-028	JD-009	Before the trained model is used in the device
9	TDM opt-out checks performed for all original sources (not just the first evaluated)	JD-009	At time of data collection for each source

Items 1–5 are blocking: the dataset must not be used in any training pipeline until they are verified.

Items 6–8 are required before deployment: the trained model must not be used in the device until they are documented.

Item 9 is ongoing: each new original source identified in the provenance chain must be checked at the time the data is actually downloaded.

The AI team (JD-009) updates the Verification Checklist section in the evaluation report (R-031-001) as each item is completed. JD-003 reviews the checklist before the trained model is deployed in the device.

Integration with GP-028

This procedure is triggered during the Data Collection phase of GP-028 (AI Development). Before any dataset is downloaded or incorporated into the training pipeline described in GP-028, a Dataset Usage Evaluation Report (R-031-001) must be completed and approved.

GP-028 governs the technical AI development lifecycle (design, training, validation, release).
GP-031 (this procedure) governs the legal and regulatory evaluation of each dataset.
GP-013 governs risk management, including risks arising from dataset usage.
GP-050 and GP-052 govern data protection and DPIA requirements.

Ongoing Monitoring

The legal landscape for AI training data is evolving rapidly. Key developments to monitor:

CJEU Case C-250/25 (Like Company v. Google Ireland): Expected late 2026 or 2027, will set binding precedent on whether AI training engages reproduction rights and whether the TDM exception applies.
Copyright Directive Review: Mandated by the DSM Directive for June 2026. The Commission will review the TDM exceptions and their interaction with AI.
Spain's ECL Proposal: Extended collective licensing for AI training, in consultation since late 2024, final text not yet published as of March 2026.
European Parliament proposals (February 2026): If enacted, could require itemized training content lists and remuneration obligations.

JD-003 is responsible for monitoring these developments and updating this procedure when the legal landscape changes materially.

License Analysis Reference

CC-BY-4.0 (Attribution)

Commercial AI training: Permitted.
Requirements: Attribution to the creator(s).
TDM interaction: Art. 4 TDM also applies, providing an additional legal basis.
Risk level: Low.

CC-BY-NC-4.0 (Attribution, Non-Commercial)

Commercial AI training under the license alone: Not permitted. "NonCommercial" means uses not primarily intended for or directed toward commercial advantage or monetary compensation.
TDM override: Art. 4 TDM can override the NC restriction if (a) access was lawful, (b) no machine-readable TDM opt-out exists.
Critical check: Verify whether the dataset source has a machine-readable TDM reservation.
Risk level: Medium. The TDM override argument is legally sound but has limited court precedent for this specific interaction.

CC-BY-SA-4.0 (Attribution, ShareAlike)

Commercial AI training: Permitted (no NC restriction).
ShareAlike implications: If the trained model or its outputs constitute an "adaptation," they must be shared under the same license. However, the prevailing view is that model weights are not an "adaptation" of individual training works. If Art. 4 TDM applies, the SA condition may not trigger.
Risk level: Medium-low for training; higher if sharing the model publicly.

Apache 2.0

Commercial AI training: Permitted.
Requirements: Attribution, inclusion of license notice. Includes express patent grants from contributors.
Risk level: Low.

MIT License

Commercial AI training: Permitted.
Requirements: Include copyright notice and license text.
Caution: No express patent grant (unlike Apache 2.0).
Risk level: Low.

Custom Academic / Institutional Licenses

Must be evaluated case by case.
Common restrictions include: "research use only," "non-commercial," "no redistribution."
TDM exception may override "non-commercial" restrictions, but "no redistribution" of the dataset itself must still be respected.
Risk level: High without individual review.

Records

Record ID	Name	Description
R-031-001	Dataset Usage Evaluation Report	Completed for each dataset or batch of related datasets evaluated for AI training use

Associated Procedures

GP-028 AI Development
GP-013 Risk Management
GP-050 Data Protection
GP-052 Data Privacy Impact Assessment (DPIA)

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Author: JD-003 Design & Development Manager
Reviewer: JD-004 Quality Manager & PRRC
Approver: JD-001 General Manager

Procedure Flowchart​

Purpose​

Scope​

Reference Documents​

Terms and Definitions​

Responsibilities​

JD-009 (AI Team)​

JD-003​

JD-004​

JD-001​

Detailed Process​

Overview​

Step 1: License Check​

Step 2: TDM Opt-Out Check​

Step 3: GDPR Assessment​

Step 4: MDR / AI Act Documentation​

Verification Before Use in the Device​

Integration with GP-028​

Ongoing Monitoring​

License Analysis Reference​

CC-BY-4.0 (Attribution)​

CC-BY-NC-4.0 (Attribution, Non-Commercial)​

CC-BY-SA-4.0 (Attribution, ShareAlike)​

Apache 2.0​

MIT License​

Custom Academic / Institutional Licenses​

Records​

Associated Procedures​