R-TF-028-004 Data Annotation Instructions - ICD-11 Mapping

Table of contents

Context
- Data Sources Summary
Objectives
Annotation Personnel
- Primary Annotation Role: JD-009 Medical Data Scientist
  - Qualifications
  - Responsibilities
- Supporting Clinical Role: JD-022 Medical Manager
  - Qualifications
  - Responsibilities
Annotation Protocol
Quality Control and Review
Version Control and Traceability
Dataset Processing Workflow

Context

The Legit.Health Plus device development dataset, known as LegitHealth-DX, is compiled from multiple heterogeneous sources and prepared first for the development of the ICD Category Distribution model, which requires the correct labelling of the diagnoses according to the ICD-11 classification system. This document focuses strictly on that.

The sources of the data include:

Archive Data: Images of skin lesions provinient from repositories with diagnostic confirmations
Custom Gathered Data: Clinical studies and prospectively collected datasets.

Each source provides diagnostic labels in various formats and nomenclatures. Some arhive sources may use abbreviated terms (e.g., "BCC", "SCC"), common names, alternative spellings (e.g., "Hemangioma" vs "Haemangioma"), or legacy coding systems, while custom gathered data may use more structured diagnoses or other standardized terminologies. Each source may provide a single label for each image or many (differential diagnosis).

To ensure consistency, clinical validity, and regulatory compliance across all data sources, all diagnostic labels must be mapped to a single, standardized classification system: ICD-11 (International Classification of Diseases, 11th Revision). In the case of differential diagnoses, each item of the diagnosis is mapped to its corresponding ICD-11 category.

This document describes the formal, multi-stage process for standardizing and mapping all diagnosis labels from these heterogeneous data sources to their corresponding ICD-11 categories. The mapping will be performed by the Medical Data Science (MDS) team who will review each unique diagnosis string present in the merged dataset and assign the appropriate ICD-11 code and description based on established medical literature, clinical guidelines, and the official ICD-11 classification system and Browser. Dermatologists will also be involved in the review and validation of the mappings to ensure clinical accuracy.

Data Sources Summary

The following table summarizes all data sources gathered and described in the R-TF-028-003 collection documents:

ID	Dataset Name	Type	Description	ICD-11 Mapping	Crops	Diff. Dx	Sex	Age
1	Torrejon-HCP-diverse-conditions	Multiple	Dataset of skin images by physicians with good photographic skills	✓ Yes	Varies	✓	✓	✓
2	Abdominal-skin	Archive	Small dataset of abdominal pictures with segmentation masks for `Non-specific lesion` class	✗ No	Yes (programmatic)	—	—	—
3	Basurto-Cruces-Melanoma	Custom gathered	Clinical validation study dataset (`MC EVCDAO 2019`)	✓ Yes	Yes (in-house crops)	—	✓	✓
4	BI-GPP (batch 1)	Archive	Small set of GPP images from Boehringer Ingelheim (first batch)	✓ Yes	No	—	—	—
5	BI-GPP (batch 2)	Archive	Large dataset of GPP images from Boehringer Ingelheim (second batch)	✓ Yes	Yes (programmatic)	—	✓	✓
6	Chiesa-dataset	Archive	Sample of head and neck lesions (Medela et al., 2024)	✓ Yes	Yes (in-house crops)	—	◐	◐
7	Figaro 1K	Archive	Hair style classification and segmentation dataset, repurposed for `Non-specific finding`	✗ No	Yes (in-house crops)	—	—	—
8	Hand Gesture Recognition (HGR)	Archive	Small dataset of hands repurposed for non-specific images	✗ No	Yes (programmatic)	—	—	—
9	IDEI 2024 (pigmented)	Archive	Prospective and retrospective studies at IDEI (DERMATIA project), pigmented lesions only	✓ Yes	Yes (programmatic)	—	✓	◐
10	Manises-HS	Archive	Large collection of hidradenitis suppurativa images	✗ No	Not yet	—	✓	✓
11	Nails segmentation	Archive	Small nail segmentation dataset repurposed for `non-specific lesion`	✗ No	Yes (programmatic)	—	—	—
12	Non-specific lesion V2	Archive	Small representative collection repurposed for `non-specific lesion`	✗ No	Yes (programmatic)	—	—	—
13	Osakidetza-derivation	Archive	Clinical validation study dataset (`DAO Derivación O 2022`)	✓ Yes	Yes (in-house crops)	◐	✓	✓
14	Ribera ulcers	Archive	Collection of ulcer images from Ribera Salud	✗ No	Yes (from wound masks, not all)	—	—	—
15	Transient Biometrics Nails V1	Archive	Biometric dataset of nail images	✗ No	Yes (programmatic)	—	—	—
16	Transient Biometrics Nails V2	Archive	Biometric dataset of nail images	✗ No	No (close-ups)	—	—	—
17	WoundsDB	Archive	Small chronic wounds database	✓ Yes	No	—	✓	◐
18	Clinica Dermatologica Internacional - Acne	Custom gathered	Compilation of images from CDI's acne patients with IGA labels	✓ Yes	No	—	—	—

Total datasets: 52 | With ICD-11 mapping: 38

Legend: ✓ = Yes | ◐ = Partial/Pending | — = No

Objectives

The primary objectives of this annotation procedure are:

To create a definitive, standardized "Visible ICD-11" mapping table that formally links every unique diagnostic label string from retrospective and prospective datasets to visually-determined diagnostic categories. Each "Visible ICD-11" category may correspond to a single ICD-11 code or to an array of multiple ICD-11 codes that share indistinguishable or highly similar visual features. In other words, one or more diagnostic items can be mapped to the same Visible ICD-11 category.
To ensure this category mapping is clinically accurate, consistent, and justifiable based on current medical knowledge and ICD-11 classification guidelines, while recognizing the limitations of visual assessment alone.
To resolve ambiguities and variations in diagnostic nomenclature (e.g., "Hemangioma" → "Haemangioma" → standardized Target Name → ICD-11 code) for a unified diagnostic vocabulary.
To identify and consolidate diagnostically distinct conditions that cannot be reliably distinguished based on visual features alone, preventing the model from learning spurious artifacts. For example, contact dermatitis and atopic dermatitis have different ICD-11 codes but share overlapping visual presentations; these are consolidated into a single "Visible ICD-11" target category (e.g., "Eczematous dermatitis") that encompasses both conditions. The final differentiation between such conditions is the responsibility of the healthcare professional, who has access to additional clinical information (patient history, symptoms, triggering factors, etc.) beyond what is visible in the image.
To produce a version-controlled artifact that serves as the ground truth diagnostic classification for all images in the development dataset.

Annotation Personnel

Primary Annotation Role: JD-009 Medical Data Scientist

Qualifications

Required: Position JD-009 as defined in the organizational structure, with expertise in medical data processing and standardization.
Recommended: Experience with medical terminologies, classification systems (ICD-10, ICD-11, SNOMED CT), and dermatological datasets.
Required Knowledge: Understanding of dermatological conditions and their visual manifestations sufficient to perform initial mapping decisions.

Responsibilities

The JD-009 Medical Data Scientist performs the following processing work:

To review the complete list of unique diagnosis strings extracted from both all dataset sources.
To assign the appropriate "Visible ICD-11" category name and ICD-11 code(s) to each unique diagnosis string, leveraging the existing diagnostic labels already present in the source datasets.
To identify synonyms, abbreviations, and spelling variations and map them to standardized categories.
To use the official ICD-11 browser, medical literature, and clinical resources to perform initial mappings.
To identify cases requiring clinical consultation (e.g., decisions about merging visually indistinguishable conditions, ambiguous diagnoses, or complex differential diagnoses).
To document all mapping decisions and maintain the master mapping table.
To coordinate with the dermatologist for validation of clinically complex or ambiguous mappings.

Supporting Clinical Role: JD-022 Medical Manager

Qualifications

Required: Board-certified dermatologist.
Recommended: Extensive clinical experience (>10 years) in diagnosing a comprehensive range of dermatological diseases, including neoplastic, inflammatory, and infectious conditions.

Responsibilities

The dermatologist provides clinical expertise for specific decisions, including:

To provide clinical consultation on ambiguous or complex mapping decisions identified by the data scientist.
To validate decisions regarding the consolidation of multiple ICD-11 codes into single "Visible ICD-11" categories when conditions cannot be reliably distinguished based on visual features alone.
To resolve dermatology-related doubts about differential diagnoses, overlapping conditions, or borderline cases.
To review and approve category mergers and exclusions proposed by the data scientist.
To provide written justification referencing medical literature for clinically complex mappings.
To conduct periodic quality control reviews of completed mappings to ensure clinical accuracy.

Annotation Protocol

The creation of the ICD-11 mapping follows a structured, multi-step process that integrates data from multiple sources into a unified, standardized dataset.

Data Source Processing and Label Extraction

For each new data source X added to LegitHealth-DX, the MDS Team will:

Create a source-specific processing script that handles the content of dataset X, including: diagnosis labels, patient sex and age, image type, and so on. This results in a dataset CSV (X_dataset.csv) with standardized metadata including image paths, diagnostic labels as they appear in the original source, plus some basic clinical metadata (sex, age).
Extract all unique diagnosis strings from the dataset CSV and create a source-specific renaming file (X_renaming.csv) containing all unique diagnostic labels from that source.

Master Mapping Document Preparation

The MDS team will then merge all source-specific renaming files (X_renaming.csv) into a single master mapping Google spreadsheet ("LegitHealth-DX ICD category management"), with a dedicated tab for each data source containing:

Source Label: The exact diagnostic string as it appears in the source dataset.
Target Name: The standardized "Visible ICD-11" category name.
ICD-11 Code(s): The official ICD-11 code(s) given to this visible category, e.g. "2C30" for "Basal cell carcinoma of skin".
Notes/justification: Comments, rationale, or literature references explaining the mapping and any consolidation decisions.

Target names and ICD-11 codes will be pre-filled by the MDS team based on visual features that can be reliably determined from images, using clinical knowledge, medical literature, and the official ICD-11 Browser. If some diagnosis strings refer to categories that are visually indistinguishable from each other and require additional clinical context to differentiate, they will be all mapped to the same Target Name.

This master spreadsheet serves as the central repository for all diagnostic mappings across all data sources.

Medical Review of "Visible ICD-11" Assignments

The designated Medical Expert(s) will review each tab of the master spreadsheet. For each unique diagnosis string from every source, the expert will:

Review and confirm the assigned Target Name (standardized "Visible ICD-11" category name), and modify them if needed.
Review and confirm the assigned ICD-11 Code(s), and modify them if needed.
Document justification for any ambiguous cases, multiple possible mappings, consolidation of multiple ICD-11 codes into a single visible category, or when clinical judgment was required.

Mapping Guidelines

Abbreviations and Acronyms: Map common abbreviations to their full clinical equivalents (e.g., "BCC" → "Basal cell carcinoma of skin" → ICD-11 code 2C30).
Synonyms and Variants: Multiple diagnosis strings that refer to the same condition should be mapped to the same Target Name and ICD-11 code(s), e.g. both "BCC" and "Basalioma" are mapped to "Basal cell carcinoma".
Legacy Coding Systems: For labels using older classification systems (ICD-10, SNOMED, etc.), translate to the corresponding ICD-11 code(s) using official crosswalk tables when available, verified by clinical expertise.
Spelling Variations: Handle alternative spellings consistently (e.g., "Hemangioma" vs "Haemangioma" → same Target Name).
Visually Indistinguishable Conditions: When multiple distinct ICD-11 diagnoses share the same or highly similar visual presentations and cannot be reliably differentiated from images alone, they should be consolidated into a single "Visible ICD-11" category by assigning the same Target Name, which should reflect the broader category (e.g., "Eczematous dermatitis" for both contact and atopic dermatitis).
- Document the clinical rationale for consolidation and specify what additional information healthcare professionals would need to make the final differentiation (e.g., patient history, allergen exposure, chronicity).
Ambiguous Labels: When a diagnosis string is ambiguous or could map to multiple ICD-11 codes that are not visually similar, the expert should:
- Select the most clinically appropriate and specific code based on available context.
- Document the rationale in the justification column.
Non-specific or Incomplete Labels: If a diagnosis string is too vague to map to a specific ICD-11 code, map to the most appropriate parent category and document the limitation.
Exclusions: Assign - as the Target Name for images that should be excluded from the dataset (e.g., poor quality, non-dermatological content, or images that cannot be reliably diagnosed).

After initial mapping, an iterative review process is conducted by the MDS team and the Medical Expert(s) to manage the complete ICD-11 category set:

Category Consolidation: MDS team and Medical Expert(s) review the complete list of mapped categories to identify:
- Redundant categories that should be merged (e.g., closely related diagnostic terms).
- Categories that should be excluded due to insufficient clinical relevance or data quality.
Updates to the master spreadsheet: All category-level decisions (mergers, exclusions, corrections) must be documented.

Important: Any changes to Target Names must be manually applied across all relevant tabs in the master spreadsheet to ensure consistency.

ICD-11 Code Validation

To ensure all categories have valid ICD-11 codes:

The MDS Team runs automated scripts to verify that every Target Name in the master spreadsheet has an assigned ICD-11 code.
Any strings with missing codes are flagged and reviewed for completion.

The ICD-11 API is used to validate code accuracy and retrieve official descriptions for all codes in the mapping, whether single codes or arrays of multiple codes.

Dataset Generation and Finalization

Once all mappings are complete and validated. The MDS team runs a script to:

Combine all dataset CSVs (X_dataset.csv) into a single file.
Download the latest version of the master spreadsheet.
Apply all mappings to convert source-specific labels to standardized "Visible ICD-11" target categories.
Generate the unified LegitHealth-DX dataset with standardized labels: Images are organized into folders by their "Visible ICD-11" category name.

The complete mapping (including both single-code and multi-code categories) is version-controlled and documented by taking a snapshot of the master spreadsheet and saving it among the LegitHealth-DX dataset metadata files.

Quality Control and Review

To ensure the highest level of clinical accuracy and robustness, the following quality control steps are implemented:

Primary Review: The completed mapping is reviewed by another JD-009 that has not taken part in the creation process to ensure completeness and internal consistency.
Secondary Review: The completed and justified mapping is independently reviewed by a board-certified dermatologist (JD-022 Medical Manager) who was not involved in the initial annotation.
Consensus Resolution: Any discrepancies between the primary annotation and the secondary review are resolved by assuming the secondary review as correct.
Automated Validation: The ICD-11 API is used to programmatically validate all assigned codes and ensure they correspond to valid ICD-11 categories.
Final Approval: The consensus-driven mapping is formally approved and version-controlled. This finalized mapping serves as the definitive ground truth diagnostic classification for all images in the LegitHealth-DX dataset.

Version Control and Traceability

Each iteration of the LegitHealth-DX dataset is assigned a version number (e.g., DXv27.1). For each version:

The complete ICD-11 mapping master spreadsheet is downloaded and archived with the corresponding version identifier.
All processing scripts, renaming files, and dataset metadata are version-controlled in the project repository.
The mapping between source labels and standardized ICD-11 categories is fully traceable through the master spreadsheet and its source-specific tabs.
Changes to category names, mergers, or exclusions are documented in the master spreadsheet.

Dataset Processing Workflow

Signature meaning

The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:

Author: Team members involved
Reviewer: JD-003, JD-004
Approver: JD-001

Context​

Data Sources Summary​

Objectives​

Annotation Personnel​

Primary Annotation Role: JD-009 Medical Data Scientist​

Qualifications​

Responsibilities​

Supporting Clinical Role: JD-022 Medical Manager​

Qualifications​

Responsibilities​

Annotation Protocol​

Data Source Processing and Label Extraction​

Master Mapping Document Preparation​

Medical Review of "Visible ICD-11" Assignments​

Mapping Guidelines​

Category Management and Refinement​

ICD-11 Code Validation​

Dataset Generation and Finalization​

Quality Control and Review​

Version Control and Traceability​

Dataset Processing Workflow​

Context

Data Sources Summary

Objectives

Annotation Personnel

Primary Annotation Role: JD-009 Medical Data Scientist

Qualifications

Responsibilities

Supporting Clinical Role: JD-022 Medical Manager

Qualifications

Responsibilities

Annotation Protocol

Data Source Processing and Label Extraction

Master Mapping Document Preparation

Medical Review of "Visible ICD-11" Assignments

Mapping Guidelines

Category Management and Refinement

ICD-11 Code Validation

Dataset Generation and Finalization

Quality Control and Review

Version Control and Traceability

Dataset Processing Workflow