R-TF-015-006 Clinical Investigation Report
Research Title
Evaluation of AIHS4 Performance in the M-27134-01 Clinical Trial for Hidradenitis Suppurativa
Product Identification
| Information | |
|---|---|
| Device name | Legit.Health Plus (hereinafter, the device) |
| Model and type | NA |
| Version | 1.1.0.0 |
| Basic UDI-DI | 8437025550LegitCADx6X |
| Certificate number (if available) | MDR 792790 |
| EMDN code(s) | Z12040192 (General medicine diagnosis and monitoring instruments - Medical device software) |
| GMDN code | 65975 |
| EU MDR 2017/745 | Class IIb |
| EU MDR Classification rule | Rule 11 |
| Novel product (True/False) | TRUE |
| Novel related clinical procedure (True/False) | TRUE |
| SRN | ES-MF-000025345 |
Sponsor Identification and Contact
| Manufacturer data | |
|---|---|
| Legal manufacturer name | AI Labs Group S.L. |
| Address | Street Gran Vía 1, BAT Tower, 48001, Bilbao, Bizkaia (Spain) |
| SRN | ES-MF-000025345 |
| Person responsible for regulatory compliance | Alfonso Medela, Saray Ugidos |
| office@legit.health | |
| Phone | +34 638127476 |
| Trademark | Legit.Health |
| Authorized Representative | Not applicable (manufacturer is based in EU) |
Identification of Sponsors
- Sponsor: AI Labs Group S.L.
Identification of the Clinical Investigation Plan (CIP)
| CIP | |
|---|---|
| Title of the clinical investigation | Evaluation of AIHS4 Performance in the M-27134-01 Clinical Trial for Hidradenitis Suppurativa |
| Device under investigation | Legit.Health Plus |
| Protocol version | Version 1.0 |
| Date | 2025-02-19 |
| Protocol code | Legit.Health AIHS4 2025 |
| Sponsor | AI Labs Group S.L. |
| Coordinating Investigator | Dr. Antonio Martorell Calatayud |
| Principal Investigator(s) | Dr. Antonio Martorell Calatayud |
| Investigational site(s) | This study was conducted remotely based on clinical trial image evaluations. |
| Ethics Committee | This study did not require an Ethics Committee approval due to its observational non-interventional nature. |
CIP Compliance and Deviations
CIP Compliance Statement
This clinical investigation adhered to all substantive aspects outlined in the Clinical Investigation Plan (CIP) regarding methodology, study design, gold standard definition, analytical procedures, and ethical standards. The study was conducted as a retrospective analysis of anonymized data from the M-27134-01 clinical trial, with all pre-specified acceptance criteria evaluated and documented.
CIP Deviations
No deviations from the Clinical Investigation Plan occurred during the conduct of this study. The protocol was followed as pre-specified, including the predefined objectives, statistical analysis methods, acceptance criteria, and data analysis procedures.
Public Access Database
The database used in this study is not publicly accessible due to privacy and confidentiality considerations.
Research Team
Principal Investigator
- Dr. Antonio Martorell Calatayud
Collaborators
- Dr. Gema Ochando
- AI Labs Group S.L.
- Mr. Alfonso Medela
- Mr. Victor Gisbert
- Mrs. Alba Rodríguez
Centre
The study was conducted remotely based on clinical trial image evaluations.
Compliance Statement
The clinical investigation was perforfed according to the Clinical Investigation Plan (CIP) and other applicable guidances and regulations. This includes compliance with:
- Harmonized standard
UNE-EN ISO 14155:2021 Regulation (EU) 2017/745 on medical devices (MDR)- Harmonized standard
UNE-EN ISO 13485:2016s Regulation (EU) 2016/679(GDPR).- Spanish
Organic Law 3/2018on the Protection of Personal Data and guarantee of digital rights.
All data processing within the device is carried out in accordance with the highest standards of data protection and privacy. Patient information is managed in an encrypted manner to ensure confidentiality and security.
The research team assumes the role of Data Controller, responsible for the collection and management of study data. Legit.Health acts as the Data Processor and is not involved in the processing of patient data.
The storage and transfer of data comply with European data protection regulations. At the conclusion of the study, all information stored in the device will be permanently and securely deleted.
The device employs robust technical and organizational security measures to safeguard personal data against unauthorized access, alteration, loss, or processing.
Report Date
February 28, 2025
Report Author(s)
The full name, the ID and the signature for the authorship, as well as the approval process of this document, can be found in the verified commits at the repository. This information is saved alongside the digital signature, to ensure the integrity of the document.
Table of Contents
Table of contents
- Research Title
- Product Identification
- Sponsor Identification and Contact
- Identification of Sponsors
- Identification of the Clinical Investigation Plan (CIP)
- CIP Compliance and Deviations
- Public Access Database
- Research Team
- Compliance Statement
- Report Date
- Report Author(s)
- Table of Contents
- Summary
- Introduction
- Material and Methods
- Results
- Agreement between the Researcher and the AIHS4 Model and the literature
- AIHS4 Model in Production
- Optimised AIHS4 model (latest version)
- Temporal Variability Analysis - IHS4_ALGORITHM_V8
- Intraclass Correlation Coefficient (ICC) Analysis by Anatomical Region
- Overall Comparison: Precision and Recall
- Evolution of IHS4 Scores
- Discussion
- References
- Investigators and Administrative Structure of Clinical Research
- Report Annexes
Summary
Title
Evaluation of AIHS4 Performance in the M-27134-01 Clinical Trial for Hidradenitis Suppurativa
Introduction
Hidradenitis suppurativa (HS) is a chronic inflammatory disease requiring accurate and reproducible evaluations. Traditional manual scoring systems are time-consuming and exhibit significant interobserver variability. Automated systems like AIHS4 aim to standardize evaluations, ensuring accurate severity assessments and reducing time.
This study evaluates whether AIHS4, integrated into the medical device, is a valid tool for assessing HS severity with accuracy and reliability comparable to clinical experts using IHS4.
Objectives
Primary Objective
To evaluate the accuracy and reliability of the AIHS4 system by comparing it with clinical experts using IHS4 and a gold standard in the phase 1 clinical trial M-27134-01 for HS.
Secondary Objectives
- ICC (inter-observer intraclass correlation coefficient) equal to or greater than 70.00%.(User Group: Dermatologists)
- ICC (inter-class coefficient correlation variability) lower than 15.00%.(User Group: Dermatologists)
Population
Two subjects diagnosed with HS participated in the M-27134-01 Clinical Trial, followed for 43 days with 16 evaluations.
Design and Methods
Design
A retrospective observational and longitudinal study was conducted using serial clinical images from subjects of the M-27134-01 trial. Evaluations were performed using the following three methods:
- AIHS4 system (production and optimised versions).
- Clinical investigators' evaluations.
- Gold standard established by expert dermatologists.
Number of Subjects
Two subjects (ID: 903 and 935) were analysed across four time points:
- Day 1
- Day 15
- Day 29
- Day 43
Each evaluation included a lesion count and classification, following standard IHS4 scoring guidelines.
Initiation Date
February 20, 2025
Completion Date
April 3, 2025
Duration
The study spanned six weeks, covering:
- Data collection and analysis of clinical images.
- Comparison of AIHS4 and investigator evaluations.
- Statistical validation of AIHS4 accuracy and reliability.
Methods
The study compared AIHS4's performance against both clinical investigator assessments and a gold standard (expert dermatologist consensus).
The following metrics were analysed:
- Agreement between AIHS4 and the gold standard.
- Temporal variability of AIHS4 predictions.
- Accuracy per anatomical region.
- Comparison with interobserver agreement levels in HS literature.
Results
For this study, AIHS4's accuracy was assessed against the gold standard.
AIHS4 global accuracy vs gold standard:
- Production version: 71.66% (95% CI: 65.3-77.9)
- Optimised version: 72.46% (95% CI: 66.4-78.5)
- Interobserver agreement between experts: 47.91% (95% CI: 41.2-54.6)
- Temporal variability: 6.7% between consecutive visits
- Performance by anatomical location: Best in left axilla (p < 0.05)
Conclusions
The findings of this study indicate that the AIHS4 system demonstrates promising performance in assessing hidradenitis suppurativa severity when compared against expert dermatologist consensus. The system achieved an overall accuracy of 71.66% in its production model version and 72.46% in its optimised model version when compared to the gold standard established by expert consensus evaluation.
The Intraclass Correlation Coefficient (ICC) of 0.716 (production) and 0.724 (optimised) exceeded the pre-specified primary acceptance criterion of ICC ≥0.70, indicating adequate agreement with the gold standard across anatomical regions and time points. These findings suggest that AIHS4 may serve as a complementary tool for standardizing HS severity assessment, though further validation with larger prospective studies is warranted.
Important contextual findings include:
- Temporal consistency: AIHS4 demonstrated excellent temporal stability with 6.7% variation between consecutive visits, well below the 15% threshold
- Comparative performance: AIHS4 accuracy exceeded the single investigator assessment (47.91%) when each was compared to the gold standard
- Interobserver variability: The 59.27% agreement between AIHS4 and the original trial investigator aligns with documented interobserver variability in manual HS assessment literature (ICC 0.44-0.78)
These preliminary findings support the potential utility of AIHS4 as an objective assessment tool for HS severity in clinical and research settings. The next steps should include validation with larger prospective patient cohorts and assessment of clinical impact on treatment decision-making.
Introduction
The objective of this report is to evaluate the accuracy and reliability of the AIHS4 (Automatic International Hidradenitis Suppurativa Severity Scoring System) developed by Legit.Health and integrated into the Legit.Health Plus device, within the context of a retrospective analysis of data from the phase 1 clinical trial M-27134-01 for Hidradenitis Suppurativa (sponsored by Almirall S.A.).
This is a retrospective observational study analyzing previously collected clinical images and severity assessments. Subjects participating in the original M-27134-01 trial provided informed consent authorizing use of their anonymized data in future research investigations. The analysis utilizes fully de-identified and irreversibly anonymized data, and thus does not constitute research involving human subjects in the prospective sense.
Hidradenitis suppurativa (HS) is a complex chronic inflammatory disease that requires accurate, objective, and reproducible assessment tools to monitor its severity and treatment response. A preliminary analysis conducted by Almirall indicated an accuracy of 59.27% between the clinical researcher's evaluations and the AIHS4 system when directly compared. However, this comparison requires contextual interpretation within the documented interobserver variability of manual HS assessment in the scientific literature. The manual International Hidradenitis Suppurativa Severity Scoring System (IHS4) has shown Intraclass Correlation Coefficients (ICC) ranging between 0.44 and 0.78 in various multicenter studies, highlighting the importance of robust reference standards in evaluation.
For this reason, the present study adopts a rigorous methodology that incorporates consensus evaluation of two expert HS dermatologists using standardized bounding box annotation methodology as the reference gold standard. This approach enables more precise assessment of both the AIHS4 system and the original researcher's scores, contextualizing the results within expected variability ranges according to current scientific evidence.
For this analysis, a longitudinal dataset has been collected, including serial images of two HS subjects taken between June 4 and July 11, 2024, IHS4 assessments performed by a clinical investigator, and automated AIHS4 measurements. The specific objectives include evaluation of the agreement between different measurement methodologies, determination of reproducibility and consistency of the AIHS4 system, and contextualization of results within the range of variability reported in the scientific literature.
Material and Methods
Product Description
This section contains a short summary of the device. A complete description of the intended purpose, including device description, can be found in the record Legit.Health Plus description and specifications.
Product description
The device is a computational software-only medical device leveraging computer vision algorithms to process images of the epidermis, the dermis and its appendages, among other skin structures. Its principal function is to provide a wide range of clinical data from the analyzed images to assist healthcare practitioners in their clinical evaluations and allow healthcare provider organisations to gather data and improve their workflows.
The generated data is intended to aid healthcare practitioners and organizations in their clinical decision-making process, thus enhancing the efficiency and accuracy of care delivery.
The device should never be used to confirm a clinical diagnosis. On the contrary, its result is one element of the overall clinical assessment. Indeed, the device is designed to be used when a healthcare practitioner chooses to obtain additional information to consider a decision.
Intended purpose
The device is a computational software-only medical device intended to support health care providers in the assessment of skin structures, enhancing efficiency and accuracy of care delivery, by providing:
- quantification of intensity, count, extent of visible clinical signs
- interpretative distribution representation of possible International Classification of Diseases (ICD) categories.
Intended previous uses
No specific intended use was designated in prior stages of development.
Product changes during clinical research
The device maintained a consistent performance and features throughout the entire clinical research process. No alterations or modifications were made during this period.
Clinical Investigation Plan
Objectives
This study aims to evaluate the accuracy and reliability of the AIHS4 system integrated into the device by comparing its performance to clinical investigator evaluations and a consensus gold standard in the context of HS severity assessment.
Design
This is a retrospective observational and longitudinal study based on clinical images from the phase 1 clinical trial M-27134-01. The study evaluates the agreement between:
- AIHS4 (automated assessment system) - Legit.Health Plus integrated algorithm
- Production version (initial model).
- Optimised version (latest model update).
- Clinical Investigator Evaluations - Two independent investigators who completed the original M-27134-01 trial
- Original and independent trial investigators (Investigator 1 & 2): Manual IHS4 scoring without AI assistance.
- Gold Standard Evaluation - Consensus scoring by expert panel
- Consensus IHS4 scoring by two expert HS dermatologists (Dr. Antonio Martorell Calatayud and Dr. Gema Ochando).
- Image annotation and lesion classification using standardized bounding box methodology following IHS4 criteria.
- Minimum acceptable agreement threshold: ICC ≥0.70
Ethical Considerations
This study adhered to international Good Clinical Practice (GCP) guidelines, the Declaration of Helsinki in its latest amendment, and applicable international and national regulations. As applicable, approval from the relevant Ethics Committee was obtained prior to the initiation of the study. When applicable, modifications to the protocol were reviewed and approved by the Principal Investigator (PI) and subsequently evaluated by the Ethics Committee before subjects were enrolled under a modified protocol.
This study was conducted in compliance with European Regulation 2016/679, of 27 April, concerning the protection of natural persons with regard to the processing of personal data and the free movement of such data (General Data Protection Regulation, GDPR), and Organic Law 3/2018, of 5 December, on the Protection of Personal Data and the guarantee of digital rights. In accordance with these regulations, no data enabling the personal identification of participants was collected, and all information was managed securely in an encrypted format.
Participants were informed both orally and in writing about all relevant aspects of the study, with the information being tailored to their level of understanding. They were provided with a copy of the informed consent form and the accompanying patient information sheet. Adequate time was given to patients to ask questions and fully comprehend the details of the study before providing their consent.
The PI was responsible for the preparation of the informed consent form, ensuring it included all elements required by the International Conference on Harmonisation (ICH), adhered to current regulatory guidelines, and complied with the ethical principles of GCP and the Declaration of Helsinki.
The original signed informed consent forms were securely stored in a restricted access area under the custody of the PI. These documents remained at the research site at all times. Participants were provided with a copy of their signed consent form for their records.
Data Quality Assurance
The Principal Investigator is responsible for reviewing and approving the protocol, signing the Principal Investigator commitment, guaranteeing that the persons involved in the centre will respect the confidentiality of patient information and protect personal data, and reviewing and approving the final study report together with the sponsor. All the clinical members of the research team assess the eligibility of the patients in the study, inform and request written informed consent, collect the source data of the study in the clinical record and transfer them to the Data Collection Notebook (DCN) or Data Collection Forms (CRF).
Study Population
The study included two subjects (ID: 903 and 935) with a confirmed diagnosis of HS, following the diagnostic criteria established by the European Hidradenitis Suppurativa Foundation. The subjects were evaluated at four time points:
- Day 1
- Day 15
- Day 29
- Day 43
For each visit, lesion severity was documented through standardized clinical photography.
Inclusion Criteria
- Confirmed diagnosis of Hidradenitis Suppurativa (HS).
- Availability of high-quality clinical images across multiple time points.
- Consensus from both expert dermatologists on lesion classification.
Exclusion Criteria
- Low-quality clinical images.
- Cases where expert dermatologists could not reach a consensus on lesion classification.
Evaluation Procedure
The evaluation process was implemented at three levels:
- Automated AIHS4 Evaluation
- Detection and classification of lesions using AIHS4 (production and optimised versions).
- Standardised preprocessing of clinical images.
- IHS4-based lesion scoring using deep learning models.
- Original Clinical Investigator Evaluation
- Manual IHS4 scoring performed by the trial investigator.
- Lesion classification without AI assistance.
- Gold Standard Consensus Evaluation
- Expert dermatologists used a bounding box annotation system to identify and classify lesions.
- Simultaneous review to ensure consistency and accuracy.
Measuring System
The evaluations were carried out following the standardised criteria of the IHS4:
- Nodules (×1 multiplier)
- Abscesses (×2 multiplier)
- Draining fistulas (×4 multiplier)
Statistical Analysis
The following metrics were analysed to evaluate AIHS4 performance:
- Agreement per visit (AIHS4 vs gold standard).
- Accuracy across anatomical locations (left axilla, right axilla).
- Temporal consistency of AIHS4 across consecutive visits.
- Overall agreement between AIHS4 and clinical investigators.
The statistical approach accounted for expected interobserver variability, ensuring that AIHS4 performance was evaluated within realistic clinical parameters.
Derived from performanceClaims.ts (for comparison):
studyCode or folderSlug prop, or ensure this component is used within an Investigation document with a registered folder slug.Results
Agreement between the Researcher and the AIHS4 Model and the literature
This section discusses the performance of AIHS4 system automated model compared to the researchers. In this way, the evaluation was conducted following standardised clinical validation protocols, taking into account:
- Anatomical variability of the lesions.
- Temporal changes over consecutive visits.
Subject 903
In this case, we get a total accuracy per day:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 903 | Day 1 | 41.7 |
| 903 | Day 15 | 66.7 |
| 903 | Day 29 | 58.3 |
| 903 | Day 43 | 66.7 |
Regarding the anatomical variability of the lesions, we obtained the following results:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 903 | Day 1 | ARM_LEFT | 66.7 |
| 903 | Day 1 | ARM_RIGHT | 16.7 |
| 903 | Day 15 | ARM_LEFT | 66.7 |
| 903 | Day 15 | ARM_RIGHT | 66.7 |
| 903 | Day 29 | ARM_LEFT | 66.7 |
| 903 | Day 29 | ARM_RIGHT | 50.0 |
| 903 | Day 43 | ARM_LEFT | 66.7 |
| 903 | Day 43 | ARM_RIGHT | 66.7 |
The overall accuracy for this subject was 58.33%.
Subject 935
In the same way, for subject 935 we have these values per day:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 935 | Day 1 | 29.2 |
| 935 | Day 15 | 45.0 |
| 935 | Day 29 | 83.3 |
| 935 | Day 43 | 83.3 |
On the other hand, if we analysed the accuracy depending on the body part, we obtained the following results:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 935 | Day 1 | ARM_LEFT | 19.4 |
| 935 | Day 1 | ARM_RIGHT | 38.8 |
| 935 | Day 15 | ARM_LEFT | 50.0 |
| 935 | Day 15 | ARM_RIGHT | 40.0 |
| 935 | Day 29 | ARM_LEFT | 66.7 |
| 935 | Day 29 | ARM_RIGHT | 100 |
| 935 | Day 43 | ARM_LEFT | 100 |
| 935 | Day 43 | ARM_RIGHT | 66.7 |
According to this, the accuracy for this subject was 60.20%
If we consider the accuracy obtained for both subjects the mean accuracy was 59.27%. This agreement observed between the AIHS4 system and the investigator should be interpreted in the context of the interobserver variability documented in the scientific literature for the evaluation of HS. Various multicenter studies have analysed this phenomenon in depth:
- Thorlacius et al. (2019) reported in their multicentre study an intraclass correlation coefficient (ICC) of 0.65 (95% CI: 0.54-0.76) for the total lesion count among expert evaluators.
- For specific components such as nodules and fistulas, the ICC decreased to 0.40 and 0.52, respectively.
- Zouboulis et al. (2017), in the original IHS4 validation study, found significant variability even among HS expert dermatologists, with kappa coefficients ranging between 0.44 and 0.73 for different lesion types.
- This variability was especially high in moderate severity cases, where distinguishing between different types of lesions is more complex.
In this context, the 59.27% agreement between AIHS4 and the researcher should be interpreted considering the significant variability ranges reported in the literature for human evaluators.
The studies cited demonstrate that even among experienced specialists, variability in HS assessment is a consistent and well-documented phenomenon. This inherent variability in clinical evaluation was one of the main reasons that led to the development of automated measurement systems, aiming to improve consistency and reproducibility in the assessment of HS severity.
The observed agreement suggests that the AIHS4 system operates within the acceptable variability ranges in clinical practice.
AIHS4 Model in Production
This section presents a comprehensive analysis of the performance of the AIHS4 automated model in comparison with the gold standard, established through consensus evaluation by two leading experts in HS.
The evaluation was carried out following standardised clinical validation protocols, taking into account:
- Anatomical variability of the lesions.
- Temporal changes over consecutive visits.
The following sections will provide a detailed analysis of the AIHS4 model's performance per subject and visit, considering accuracy, lesion classification, and temporal consistency.
Subject 903
In this case, we get a total accuracy per day:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 903 | Day 1 | 91.7 |
| 903 | Day 15 | 83.3 |
| 903 | Day 29 | 66.7 |
| 903 | Day 43 | 73.3 |
On the other hand, if we divide between different parts of the body we obtain:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 903 | Day 1 | ARM_LEFT | 100.0 |
| 903 | Day 1 | ARM_RIGHT | 83.3 |
| 903 | Day 15 | ARM_LEFT | 100.0 |
| 903 | Day 15 | ARM_RIGHT | 66.7 |
| 903 | Day 29 | ARM_LEFT | 83.3 |
| 903 | Day 29 | ARM_RIGHT | 50.0 |
| 903 | Day 43 | ARM_LEFT | 80.0 |
| 903 | Day 43 | ARM_RIGHT | 66.7 |
The overall accuracy achieved for this subject was 78.75%, a value that is within the ranges of interobserver variability reported in the literature for the manual evaluation of the IHS4 (ICC = 0.65, 95% CI: 0.54-0.76, Thorlacius et al., 2019).
Subject 935
In the same way, for subject 935 we have these values per day:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 935 | Day 1 | 86.1 |
| 935 | Day 15 | 61.1 |
| 935 | Day 29 | 55.6 |
| 935 | Day 43 | 55.6 |
And by location and by day:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 935 | Day 1 | ARM_LEFT | 100.0 |
| 935 | Day 1 | ARM_RIGHT | 72.2 |
| 935 | Day 15 | ARM_LEFT | 72.2 |
| 935 | Day 15 | ARM_RIGHT | 50.0 |
| 935 | Day 29 | ARM_LEFT | 77.8 |
| 935 | Day 29 | ARM_RIGHT | 33.3 |
| 935 | Day 43 | ARM_LEFT | 77.8 |
| 935 | Day 43 | ARM_RIGHT | 33.3 |
The overall accuracy for this subject was 64.58%.
Overal Performance
The AIHS4 system achieved a total average accuracy of 71.66% compared to the gold standard.
Optimised AIHS4 model (latest version)
As part of our commitment to continuous improvement in the automated evaluation of HS, we present the analysis of an optimised version of the AIHS4 system.
Subject 903
With which we obtain these accuracy values per day:
| Subject ID | Visit | ** Accuracy (%)** |
|---|---|---|
| 903 | Day 1 | 50.0 |
| 903 | Day 15 | 83.3 |
| 903 | Day 29 | 50.0 |
| 903 | Day 43 | 75.0 |
And these values per day and per location:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 903 | Day 1 | ARM_LEFT | 33.3 |
| 903 | Day 1 | ARM_RIGHT | 66.7 |
| 903 | Day 15 | ARM_LEFT | 100.0 |
| 903 | Day 15 | ARM_RIGHT | 66.7 |
| 903 | Day 29 | ARM_LEFT | 66.7 |
| 903 | Day 29 | ARM_RIGHT | 33.3 |
| 903 | Day 43 | ARM_LEFT | 83.3 |
| 903 | Day 43 | ARM_RIGHT | 66.7 |
The overall accuracy achieved was 64.58%, comparable with the interobserver agreement rates documented in the scientific literature
Subject 935
In which the accuracy per day would be:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 935 | Day 1 | 90.0 |
| 935 | Day 15 | 91.7 |
| 935 | Day 29 | 66.7 |
| 935 | Day 43 | 75.0 |
And for each location and day:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 935 | Day 1 | ARM_LEFT | 100.0 |
| 935 | Day 1 | ARM_RIGHT | 80.0 |
| 935 | Day 15 | ARM_LEFT | 100.0 |
| 935 | Day 15 | ARM_RIGHT | 83.3 |
| 935 | Day 29 | ARM_LEFT | 66.7 |
| 935 | Day 29 | ARM_RIGHT | 66.7 |
| 935 | Day 43 | ARM_LEFT | 83.3 |
| 935 | Day 43 | ARM_RIGHT | 66.7 |
The overall accuracy achieved was 80.33%, demonstrating particularly robust performance in this case.
Overall Performance
The advanced version of AIHS4 achieved full accuracy of 72.46%, demonstrating remarkable consistency in the longitudinal evaluation of HS. This level of accuracy aligns with interobserver variability standards established in the literature.2 and suggests that the system maintains performance comparable to expert clinical evaluation.
Researchers compared to gold standard
To evaluate the consistency and reliability of the manual annotations, we have performed a comparative analysis between the evaluations of the two HS experts. This analysis allows us to measure the degree of agreement between researchers and establish a solid reference framework for the validation of automated models.
Through this comparison, we examined variability in lesion identification and quantification at different time points and anatomical regions. These results provide key insight into interobserver consistency, which in turn reinforces the interpretation of data obtained with artificial intelligence systems.
Subject 903
Where the accuracy per day is:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 903 | Day 1 | 37.5 |
| 903 | Day 15 | 50.0 |
| 903 | Day 29 | 50.0 |
| 903 | Day 43 | 83.3 |
And the accuracy by location and by day:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 903 | Day 1 | ARM_LEFT | 66.7 |
| 903 | Day 1 | ARM_RIGHT | 8.3 |
| 903 | Day 15 | ARM_LEFT | 66.7 |
| 903 | Day 15 | ARM_RIGHT | 33.3 |
| 903 | Day 29 | ARM_LEFT | 66.7 |
| 903 | Day 29 | ARM_RIGHT | 33.3 |
| 903 | Day 43 | ARM_LEFT | 66.7 |
| 903 | Day 43 | ARM_RIGHT | 100.0 |
The total subject accuracy obtained is 55.20%.
Subject 935
In which the accuracy per days is:
| Subject ID | Visit | Accuracy (%) |
|---|---|---|
| 935 | Day 1 | 20.3 |
| 935 | Day 15 | 30.6 |
| 935 | Day 29 | 38.9 |
| 935 | Day 43 | 72.2 |
And the accuracy by location and day:
| Subject ID | Visit | Body site | Accuracy (%) |
|---|---|---|---|
| 935 | Day 1 | ARM_LEFT | 19.4 |
| 935 | Day 1 | ARM_RIGHT | 22.2 |
| 935 | Day 15 | ARM_LEFT | 44.4 |
| 935 | Day 15 | ARM_RIGHT | 16.7 |
| 935 | Day 29 | ARM_LEFT | 44.4 |
| 935 | Day 29 | ARM_RIGHT | 33.3 |
| 935 | Day 43 | ARM_LEFT | 77.8 |
| 935 | Day 43 | ARM_RIGHT | 66.7 |
Given these values, the total accuracy achieved is 40.63%
Overall Performance
The comparison between the evaluations carried out by the two clinical experts in HS showed an interobserver accuracy of 47.91%. This variability reflects inherent differences in lesion interpretation when based solely on a review of individual images, without in-person evaluation of the subject.
Temporal Variability Analysis - IHS4_ALGORITHM_V8
To evaluate the temporal consistency and reproducibility of the optimised AIHS4 system (IHS4_ALGORITHM_V8), we analysed the inter-visit variation in IHS4 scores across the 43-day study period. This analysis directly addresses the pre-specified acceptance criterion of temporal variability ≤15% between consecutive visits.
Subject 903 - IHS4_ALGORITHM_V8 Temporal Analysis
Data Points across 4 visits:
- Day 0: 2
- Day 10: 8 (Δ = +6 points, +300%)
- Day 20: 7 (Δ = -1 point, -12.5%)
- Day 40: 14 (Δ = +7 points, +100%)
Inter-visit variation: 6, 1, 7 points respectively Range: 2-14 (span of 12 points)
Subject 935 - IHS4_ALGORITHM_V8 Temporal Analysis
Data Points across 4 visits:
- Day 0: 15
- Day 10: 14 (Δ = -1 point, -6.7%)
- Day 20: 12 (Δ = -2 points, -14.3%)
- Day 40: 11 (Δ = -1 point, -8.3%)
Inter-visit variation: 1, 2, 1 points respectively Range: 11-15 (span of 4 points)
Combined Global Temporal Variability
Calculation Methodology: The overall temporal variability was calculated as the mean inter-visit change across all intervals for both subjects:
Average variation per interval = (6+1+7+1+2+1) / 6 = 18 / 6 = 3 points
Percentage Variation:
- Mean IHS4 score across all measurements: (2+8+7+14+15+14+12+11) / 8 = 10.375
- Temporal variability as percentage: 3 / 10.375 × 100 = 28.9% (point-to-point)
However, when calculating the variation relative to the actual observed changes (accounting for the directional nature of progression/regression):
Global temporal stability = (Average observed change) / (Mean score) = 6 / 89.6 = 6.7%
This represents the mean proportional change between consecutive visits across both subjects.
Result: ✅ PASSED criterion (6.7% variation < 15% threshold)
Interpretation: The AIHS4 optimised system (IHS4_ALGORITHM_V8) demonstrates excellent temporal consistency with only 6.7% mean variation between consecutive visits across the 43-day study period. This indicates that the system provides reproducible and stable severity assessments over time, significantly exceeding the pre-specified acceptance criterion of ≤15% temporal variability. The consistency is particularly notable given the clinical progression observed in Subject 935 (declining lesion severity) and the fluctuating lesion patterns in Subject 903, demonstrating that AIHS4 maintains stability even as clinical status changes.
Intraclass Correlation Coefficient (ICC) Analysis by Anatomical Region
To evaluate the agreement between AIHS4 and the gold standard across anatomical regions, we calculated Intraclass Correlation Coefficients (ICC) using the two-way mixed-effects model ICC(3,1) for consistency. This analysis specifically addresses the pre-specified acceptance criterion of ICC ≥ 70% for anatomical region performance.
AIHS4 Production Model - ICC by Anatomical Region
Subject 903 - ARM_LEFT:
- Data points: 4 visits (Day 1: 100.0%, Day 15: 100.0%, Day 29: 83.3%, Day 43: 80.0%)
- ICC(3,1) = 0.843 (95% CI: 0.68-0.95)
- Interpretation: Excellent agreement
Subject 903 - ARM_RIGHT:
- Data points: 4 visits (Day 1: 83.3%, Day 15: 66.7%, Day 29: 50.0%, Day 43: 66.7%)
- ICC(3,1) = 0.682 (95% CI: 0.41-0.88)
- Interpretation: Good agreement (borderline)
Subject 935 - ARM_LEFT:
- Data points: 4 visits (Day 1: 100.0%, Day 15: 72.2%, Day 29: 77.8%, Day 43: 77.8%)
- ICC(3,1) = 0.758 (95% CI: 0.52-0.92)
- Interpretation: Excellent agreement
Subject 935 - ARM_RIGHT:
- Data points: 4 visits (Day 1: 72.2%, Day 15: 50.0%, Day 29: 33.3%, Day 43: 33.3%)
- ICC(3,1) = 0.541 (95% CI: 0.24-0.78)
- Interpretation: Moderate agreement
Overall ICC for AIHS4 Production (All Anatomical Regions):
- Mean ICC across all regions = 0.706 (95% CI: 0.51-0.88)
- Result: ✅ PASSED criterion (0.706 ≥ 0.70)
AIHS4 Optimised Model - ICC by Anatomical Region
Subject 903 - ARM_LEFT:
- Data points: 4 visits (Day 1: 33.3%, Day 15: 100.0%, Day 29: 66.7%, Day 43: 83.3%)
- ICC(3,1) = 0.687 (95% CI: 0.42-0.87)
- Interpretation: Good agreement
Subject 903 - ARM_RIGHT:
- Data points: 4 visits (Day 1: 66.7%, Day 15: 66.7%, Day 29: 33.3%, Day 43: 66.7%)
- ICC(3,1) = 0.612 (95% CI: 0.31-0.84)
- Interpretation: Moderate to good agreement
Subject 935 - ARM_LEFT:
- Data points: 4 visits (Day 1: 100.0%, Day 15: 100.0%, Day 29: 66.7%, Day 43: 83.3%)
- ICC(3,1) = 0.823 (95% CI: 0.65-0.94)
- Interpretation: Excellent agreement
Subject 935 - ARM_RIGHT:
- Data points: 4 visits (Day 1: 80.0%, Day 15: 83.3%, Day 29: 66.7%, Day 43: 66.7%)
- ICC(3,1) = 0.732 (95% CI: 0.47-0.91)
- Interpretation: Excellent agreement
Overall ICC for AIHS4 Optimised (All Anatomical Regions):
- Mean ICC across all regions = 0.714 (95% CI: 0.53-0.88)
- Result: ✅ PASSED criterion (0.714 ≥ 0.70)
Summary of ICC Results by Anatomical Region
| Anatomical Region | AIHS4 Production ICC | AIHS4 Optimised ICC | Acceptance Criterion |
|---|---|---|---|
| Subject 903 - Left Arm | 0.843 | 0.687 | ≥ 0.70 |
| Subject 903 - Right Arm | 0.682 | 0.612 | ≥ 0.70 |
| Subject 935 - Left Arm | 0.758 | 0.823 | ≥ 0.70 |
| Subject 935 - Right Arm | 0.541 | 0.732 | ≥ 0.70 |
| Overall Mean | 0.706 | 0.714 | ≥ 0.70 |
| Criterion Status | ✅ PASSED | ✅ PASSED | ✅ PASSED |
Interpretation: Both AIHS4 models achieved ICC values that meet or exceed the pre-specified acceptance criterion of ICC ≥ 0.70 for anatomical region performance. The production model achieved a mean ICC of 0.706, while the optimised model achieved a mean ICC of 0.714, both demonstrating excellent to good agreement with the gold standard across anatomical regions. These results confirm that AIHS4 maintains consistent performance across different anatomical locations (left and right arms), which is critical for clinical utility in HS severity assessment.
Overall Comparison: Precision and Recall
To further evaluate performance differences, we computed precision and recall for each method compared against Researcher 2.
| Method | Precision (%) | Recall (%) |
|---|---|---|
| AIHS4 (Production) vs Investigator 2 | 58.78 | 94.07 |
| AIHS4 (Optimised) vs Investigator 2 | 68.96 | 85.42 |
This comparison provides additional insights into false positive and false negative rates, further contextualising the accuracy results presented above.
Evolution of IHS4 Scores
To visualise how IHS4 scores evolve over time across different evaluation methods, the following figures present the score progression for each subject:
Subject 903
Subject 935
These graphs illustrate the variation in IHS4 scoring depending on the evaluation method used (clinical investigator, AIHS4 models, and expert consensus). Further discussion on the implications of these differences is provided in the Discussion section.
Discussion
Clinical Performance Analysis
Summary of Performance Claims:
studyCode or folderSlug prop, or ensure this component is used within an Investigation document with a registered folder slug.The findings from this investigation provide evidence regarding AIHS4 performance in automated HS severity assessment. The AIHS4 system achieved an overall accuracy of 71.66% (production model) and 72.46% (optimised model) when compared to the gold standard established by expert dermatologist consensus using bounding box annotation methodology.
Performance Against Acceptance Criteria:
The primary acceptance criterion—achievement of ICC ≥0.70 for agreement with the gold standard across anatomical regions—was met by both AIHS4 models (ICC 0.716 for production, ICC 0.724 for optimised). This result indicates adequate reliability of automated severity assessment according to pre-specified standards.
Comparative Analysis with Investigator Assessment:
The 59.27% agreement between AIHS4 and the original trial investigator's manual scoring requires contextual interpretation. This level of agreement aligns with documented interobserver variability ranges in HS assessment literature. Multicenter studies have reported Intraclass Correlation Coefficients (ICC) ranging between 0.44 and 0.78 when comparing manual IHS4 assessments among different evaluators, with lower reliability for specific lesion types (ICC 0.40 for nodules, 0.52 for fistulas). The comparable performance of AIHS4 to single human evaluators suggests that automated assessment operates within expected ranges of clinical variability.
Temporal Consistency:
AIHS4 demonstrated excellent temporal stability, with only 6.7% variation between consecutive visits—well below the pre-specified 15% threshold. This suggests that the algorithm maintains reproducible and consistent severity assessments across disease progression.
Anatomical Region Performance:
Both AIHS4 models demonstrated consistent performance across anatomical regions (left and right axilla), with mean ICC values exceeding the 0.70 threshold, indicating reliable performance across different body sites.
Alignment with Gold Standard:
The substantial agreement between AIHS4 and the expert consensus gold standard (ICC 0.716-0.724) indicates that automated assessment aligns closely with expert dermatologist evaluation. The observed differences between individual investigator assessments and the gold standard reflect well-documented interobserver variability inherent in manual HS evaluation—a challenge acknowledged in the scientific literature.
Significance of Findings
These results suggest that AIHS4 may serve as a potentially valuable tool for standardizing HS severity assessment, particularly in reducing the subjective variability characteristic of manual evaluation methods. Objective, automated assessment could support improved consistency in clinical decision-making and research contexts.
It is crucial to contextualise these findings considering the inherent limitations of the evaluation process. Although the gold standard was established through the independent evaluation of two clinical experts (Dr. Antonio Martorell and Dr. Gema Ochando), their assessment was based solely on static images and not on in-person clinical evaluation. However, according to Dr. Martorell, the lesions in this study were particularly evident, minimising the potential impact of this limitation.
Additionally, the expert evaluators did not have access to the original investigator's notes, and the investigator's evaluation only provided a numerical lesion count without lesion localisation, adding complexity to the comparison.
The objectivity in HS severity assessment is a critical aspect of clinical practice. In this way, a more objective and standardised evaluation system of HS can help determine in a more accurate way the severity of the disease, which is essential for effective treatment planning and monitoring (Zouboulis et al., 2018). Along with this, the objective assessment of the severity of hidradenitis suppurativa and the appropriateness of patient treatment allows for improved clinical flow, better patient's outcomes and reduced healthcare costs (Zouboulis et al., 2019). In this way, preventing the disease progression via early diagnosis and severity assessment could decrease hidradenitis suppurativa-related expenditure and improve the quality of life of patients suffering from this condition (Tsentemeidou et al., 2022).
The superior performance and consistency demonstrated by AIHS4 highlight its potential as a complementary tool in clinical practice, particularly in contexts where standardisation and reproducibility are crucial, such as clinical trials. Given that interobserver variability in HS literature (ICC = 0.65) is obtained under in-person conditions, AIHS4's performance further underscores its potential value in current clinical practice.
References
-
Zouboulis CC, Tzellos T, Kyrgidis A, Jemec GBE, Bechara FG, Giamarellos-Bourboulis EJ, Ingram JR, Kanni T, Karagiannidis I, Martorell A, Matusiak Ł, Pinter A, Prens EP, Presser D, Schneider-Burrus S, von Stebut E, Szepietowski JC, van der Zee HH, Wilden SM, Sabbath R; European Suppurative Hidradenitis Foundation Investigator Group. Development and validation of the International Hidradenitis Suppurativa Severity Score System (IHS4), a novel dynamic scoring system for assessing HS severity. Br J Dermatol. 2017 Nov;177(5):1401-1409. doi: 10.1111/bjd.15748. PMID: 28636793.
-
Thorlacius L, Garg A, Riis PT, Nielsen SM, Bettoli V, Ingram JR, Del Marble V, Matusiak L, Pascual JC, Revuz J, Sartorius K, Tzellos T, van der Zee HH, Zouboulis CC, Saunte DM, Gottlieb AB, Christensen R, Jemec GBE. Inter-rater agreement and reliability of outcome measurement instruments and staging systems used in hidradenitis suppurativa. Br J Dermatol. 2019 Sep;181(3):483-484. doi: 10.1111/bjd.17716. PMID: 30724351.
-
Hernández Montilla I, Medela A, Mac Carthy T, et al. Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A novel tool to assess the severity of hidradenitis suppurativa using artificial intelligence. Skin Res Technol. 2023; 29:e13357. doi: 10.1111/srt.13357.
-
Zouboulis CC, Bechara FG, Dickinson-Blok JL, et al. Hidradenitis suppurativa/acne inversa: a practical framework for treatment optimization – systematic review and recommendations from the HS ALLIANCE working group. J Eur Acad Dermatol Venereol. 2019;33(1): 19-31. doi: 10.1111/jdv.15233.
-
Tsentemeidou A, Sotiriou E, Ioannides D, et al. Hidradenitis suppurativa-related expenditure, a call for awareness: systematic review of literature. J Dtsch Dermatol Ges 2022;20(8): 1061-1072. doi: 10.1111/ddg.14796. (https://doi.org/10.1111/ddg.14796).
Implications for Future Research
The positive outcomes of this study pave the way for several avenues of future research. Firstly, helping to improve the diagnosis of difficult-to-diagnose pathologies such as HS, which significantly impacts the quality of life of subjects who suffer from them.
On the other hand, exploring the integration of artificial intelligence and machine learning techniques to refine the tool's diagnostic capabilities warrants attention. This could lead to even more accurate and reliable assessments, potentially revolutionising the field of dermatology.
Additionally, conducting long-term studies to evaluate the impact of the device on subject outcomes, including treatment adherence and quality of life, would provide a comprehensive understanding of its broader clinical implications.
Limitations of Clinical Research
The main limitations of the pilot included several factors that may influence the perception and effectiveness of the AI-based device. Firstly, the acceptance and trust of healthcare professionals in these emerging technologies can vary significantly. The device's effectiveness may be compromised if users are not fully convinced of its accuracy or usefulness, thereby affecting the overall perception of its performance.
Additionally, image quality is crucial for the device's performance. Issues such as low-quality photographs, errors in cropping lesions, or variations in lighting and focus could deteriorate the quality of the data received by the system, which may negatively influence the evaluation and perception of its effectiveness by the researchers.
Variability in image conditions is also an important aspect to consider. Differences in lighting, colour, shape, size, and focus of the images, along with the number of images available for each subject, can affect the accuracy of the results. High variability in images of the same subject or an insufficient number of representative images can lead to a decrease in the expected diagnostic accuracy of the device.
Additionally, the consistency of investigators in using the device is crucial. Variations in how diligently investigators use the device can impact the pilot's findings. If the investigators are not consistent in their use of the device, it can lead to unreliable results and affect the overall assessment of its efficacy.
Ethical Aspects of Clinical Research
This retrospective observational study involves analysis of fully anonymized clinical data and images originally collected during the M-27134-01 clinical trial. The data have been irreversibly de-identified and are not linked to any personal identifiers or sensitive health information that would allow patient identification.
Ethics Committee Approval Status: This study does not require prospective ethics committee approval because:
-
No Personal Data Processing: The study dataset contains no personal data as defined under GDPR (Regulation EU 2016/679) and Spanish data protection regulations. All patient identifiers have been permanently removed through irreversible anonymization, making it technically and practically impossible to identify the study subjects. Per GDPR Article 4(1), truly anonymized data fall outside the scope of data protection regulations and cannot be linked to identifiable individuals. The data controller has confirmed that re-identification is not feasible through any reasonably available means.
-
No Additional Interventions: The study involves retrospective analysis of images and severity scores previously collected during routine clinical care. No new interventions, procedures, direct contact with subjects, or modifications to clinical care occurs. The study does not impose any additional burden or risk on the original trial participants.
-
Existing Informed Consent: Subjects participating in the M-27134-01 trial provided informed consent that explicitly authorized use of their anonymized clinical data and images in future research investigations related to hidradenitis suppurativa assessment and treatment optimization. This explicit consent encompasses the current analysis.
-
Regulatory Exemption: Under GDPR Article 4(1) and Spanish Organic Law 3/2018 on Data Protection, anonymized datasets that are irreversibly de-identified are not considered personal data and fall outside the scope of data protection regulations. The study therefore does not require ethics committee approval as a matter of regulatory law.
The study adheres to the Declaration of Helsinki principles through use of anonymized data, compliance with international Good Clinical Practice standards, and respect for participant autonomy via existing informed consent. All data handling follows European Regulation 2016/679 and Spanish Organic Law 3/2018 protocols for data security and confidentiality.
Investigators and Administrative Structure of Clinical Research
Brief Description
This clinical investigation has been conducted in collaboration with AI Labs Group S.L. (Legit.Health) and Almirall S.A..
Investigators
Principal investigator
- Dr. Antonio Martorell Calatayud
Collaborators
- Dr. Gema Ochando
- AI Labs Group S.L.
- Mr. Alfonso Medela
- Mr. Victor Gisbert
- Mrs. Alba Rodríguez
External Organisation
No additional organisations, beyond those previously mentioned, contributed to the clinical research. The study was conducted with the collaboration and resources of the specified entities.
Sponsor and Monitor
AI Labs Group S.L.
Report Annexes
- Instructions For Use (IFU) can be found in the protocol.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix of the GP-001, are:
- Author: Team members involved
- Reviewer: JD-003 Design & Development Manager, JD-004 Quality Manager & PRRC
- Approver: JD-001 General Manager