GP-100 Business Continuity (BCP) and Disaster Recovery plans (DRP)
Purpose
The purpose of this document is to compile the specific steps and procedures set which should be followed when any kind of disaster happens. The overarching goals are:
- To maintain an orderly process for business resumption and systems recovery.
- To provide operational continuity and quick recovery for all critical systems impacted by a technology related disaster event.
- To ensure that the disaster recovery plan is properly communicated to all staff, clearly identifying all essential roles and responsibilities.
- To ensure that disaster recovery activities and strategies are continually tested and revised as needed.
Scope
Disasters can be classified as either environmental (natural) or man-made. Natural disasters occur at any time and without prior knowledge or warning. Man-made disasters, on the other hand, can be either intentional, such as an act of terrorism, or unintentional, such as in the occurrence of an accident caused by a person or a man-made structure.
This procedure encompasses both kinds of disasters that can occur to the installations, computer and communications equipment, personnel and documentation and information.
Definitions and abbreviations
- BCP: Business continuity plan
- Disaster is an event that prevents a workload or system from fulfilling its business objectives in its primary deployed location.
- Disaster recovery is the process of preparing for and recovering from a disaster.
- DRP: Disaster recovery plan
- RTO: Recovery Time Objective
- RPO: Recovery Point Objective
Introduction
Background information on the company
We are manufacturers of software as medical device to help health care professionals to help their patients through AI applied to dermatology. The medical device currently developed consists of a clinical decision support tool, that is intended to be used in the clinical practice to care for patients with visible skin conditions.
We founded our company in 2020. The founders are Andy Aguilar, Alfonso Medela, Gerardo Fernández and Taig Mac Carthy. The headquarters are in Bilbao, Spain.
The project was created in the context of an increasing need for digital health, offering a solution to the necessity of a fast, efficient and convenient clinical diagnostic support and a constant and objective monitoring of the severity of the conditions.
Our company is composed of a varying number of employees. The number and roles of employees are defined in the Annex 3 Organisational chart
.
Rationale for the BCP and DRP
It is important to have in place and maintain a BCP and DRP to protect our business operations from unexpected disruptions, such as natural disasters, cyber attacks, or other types of emergencies. These plans can help minimize downtime, reduce financial losses, and ensure that critical business functions can continue to operate. Moreover, we gain confidence of our customers, partners, and other stakeholders as we are prepared to handle unexpected events.
We are commited to preparedness and ensuring that business operations can continue even in the face of unexpected events, as we demonstrate with the regular testing and maintenance of the BCP and DRP to ensure their effectiveness.
Risk Assessment
We follow the GP-013 Risk Management
procedure to perform the risk identification and management following the ISO 14971 standard.
Due to our remote type of operation, We have identified that our main risks are:
- Natural disasters, such as earthquakes or floods
- Technical failures, such as power failure or network connectivity
- Human actions, such as inadvertent misconfiguration or unauthorized/outside party access or modification, cyberattacks...
We have established the following mitigation measures:
- The deployment of the medical devices uses elastic demand design. The medical device makes constant backups. State-of-the-art techniques of security and software availability. Due the inherent features of the REST protocol, when a user send a request and the device is down, the API returns a specific code informing of the state of the API, including downtime. This means that the user will be automatically aware of downtime, as well as any other states.
- Disaster recovery in the AWS Cloud, that includes the following advantages:
- Recover quickly from a disaster with reduced complexity
- Simple and repeatable testing to test more easily and more frequently
- Lower management overhead decreases operational burden
- Opportunities to automate decrease chances of error and improve recovery time
AWS Global Cloud Infrastructure
The AWS Global Cloud Infrastructure is designed to enable customers to build highly resilient workload architectures. Each AWS Region is fully isolated and consists of multiple Availability Zones, which are physically isolated partitions of infrastructure. Availability Zones isolate faults that could impact workload resilience, preventing them from impacting other zones in the Region. But at the same time, all zones in an AWS Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between zones. All traffic between zones is encrypted. The network performance is sufficient to accomplish synchronous replication between zones. When an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, hurricanes, and more.
Continuity requirements (maximum interruption times).
It has been identified that in the event of an incident that may partially or totally affect the provision of the services provided, the maximum time that can be without the support of such services or Recovery Time Objective (RTO) and the admissible loss of data or Recovery Point Objective (RTO), in particular, are as follows:
- RTO (Recovery Time Objective): RTO is the maximum acceptable delay between the interruption of service and restoration of service.
- RTO < 60m.
- RPO (Recovery Point Objective): is the maximum acceptable amount of time since the last data recovery point. This objective determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
- RPO - Files (Documents) < 60m.
- RPO - Information in database < 60m.
Additionally, we will provide an uptime service availability of 99,9%.
The continuity process and specifically the technical procedures for recovery of the IT infrastructure, is tested to ensure our ability to ensure the timely recovery of IT Services, in the event of an incident in the infrastructure in which they reside.
The objective of verifying the functionality of the technical recovery procedures is that the personnel who will be responsible for executing them in a contingency situation are familiar with them.
Our personnel will receive specific training regarding their role in the continuity procedure, both in the test plans described in this procedure and in the measures to be taken in case of interruption of the services provided. This training will be extended to service providers, in case their participation in the business continuity plans is required.
Plan Development
Prior to the elaboration of the test plans, a series of measures must be carried out to minimize the impact on the organization and contribute to minimize incidents when they occur.
All these measures and this procedure will be verified at least annually and preventively by the management, and subsequently by the parties and departments essential for the optimal implementation of them.
In any case, the management will plan and communicate to the parties involved their share of responsibility in the procedure and the tasks to be carried out:
- Periodic simulation of data restoration on a temporary server, through the recovery of management and administrative applications, databases and, in general, data essential for the smooth running of our activities.
- Study of the suitability of each worker to replace a colleague according to each role performed.
- Communication to each member of staff of their hypothetical responsibilities in the event of temporary replacement of a colleague. We have this information compiled at the
Annex 3 Organisational chart
.
According to the risks detected, we focus our tests on the information and operation systems listed on the record R-018-001
, called Infrastructure list and control plan
, that belongs to the procedure GP-018 Infrastructure and facilities
. From this list, we exclude irrelevant components such as the laptops and docker images, and focus on servers and cloud storage, as these are the components that contain the information that is relevant to the continuity of the business.
DRP activation
- Disaster Identification & Responsibilities: First of all the disaster is communicated to the
JD-001
andJD-003
, who will be the responsibles for activating the plan. - Identify the incidence and its impact and investigate to find the triggering event.
- Identify the team members needed for recovery according to table on annex 1.
- Communicate the specific recovery roles and determine which recovery strategy will be pursued.
- When necessary, communicate with the interested parties that could be affected.
- Document and track the timelines and next decisions to be made.
Procedures for rapid restoration of IT systems and data
In the event of unexpected IT disruptions or system failures, maintaining the continuity of our operations and safeguarding our data is paramount. Leveraging the suite of services from Amazon Web Services (AWS), we've developed a robust set of procedures to swiftly and efficiently restore our systems. Here's our step-by-step strategy for rapid restoration:
Backup and Storage
- Amazon S3: Ensure all critical data is routinely backed up on Amazon S3, taking advantage of its 11 9's durability. Harness the power of versioning to maintain, retrieve, and restore every version of every object stored in an Amazon S3 bucket.
- AWS Backup: Utilize AWS Backup to centralize and automate the backup processes across AWS services. Enforce retention policies, monitor backup activity, and ensure regular backups are performed.
Database Restoration
- Amazon RDS (Relational Database Service):
- Activate automated backups for daily backup creation of the entire DB instance.
- Use database snapshots to manually backup your DB instance.
- In case of failure, restore the RDS instance to a chosen snapshot or to a specific point-in-time before the disruption.
- Amazon DynamoDB:
- Use DynamoDB's continuous backups and point-in-time recovery options to restore your table to any second within the past 35 days.
Server and Application Restoration
- Amazon EC2 (Elastic Compute Cloud):
- Routinely create and store Amazon Machine Images (AMIs) of pivotal EC2 instances.
- During a disaster, launch replacement instances using the preserved AMIs to hasten recovery.
- Amazon Elastic Beanstalk: For applications hosted on Elastic Beanstalk, ensure the latest versions of your application are readily available for quick redeployment.
Monitoring and Alerting
- Amazon CloudWatch: Activate monitoring and alerts for essential metrics to detect issues promptly.
- AWS Health Dashboard: Stay updated regarding AWS service events and understand their potential impact on AWS resources.
By adhering to these procedures and optimizing the capabilities offered by AWS, we can minimize potential downtimes and ensure that our IT systems and data are rapidly restored in the face of any disruptions. Regular training sessions and updates to this plan will bolster our preparedness and resilience.
Levels of Failures or Disruptions:
1. Minimal Disruption (e.g., Brief Service Outages)
Steps to Remedy:
- Monitor: Continuously monitor the services using Amazon CloudWatch.
- Alerts: Set up CloudWatch Alarms to get instant notifications about the service disruptions.
- Assessment: Quickly assess the impact and scale of the disruption.
- Communicate: Notify users and stakeholders about the temporary glitch and expected recovery time.
- Initial Recovery: Try a soft reboot of affected instances or services.
- Documentation: Document the incident and the steps taken for future references.
2. Moderate Disruption (e.g., Application Bugs, Slow Performance)
Steps to Remedy:
- Identification: Use CloudWatch and AWS X-Ray to pinpoint the source of the issue.
- Communicate: Inform users that troubleshooting is underway and provide an estimated time for resolution.
- Rollback: If a recent deployment or change is identified as the culprit, rollback to the previous stable version using Amazon Elastic Beanstalk or EC2 AMIs.
- Optimize: Address any identified performance bottlenecks, possibly utilizing AWS Auto Scaling to manage varying loads.
- Testing: Before redeploying, test the changes in a staging environment.
- Documentation: Log the issue, solution, and steps taken.
3. Major Disruption (e.g., Data Loss, Security Breaches)
Steps to Remedy:
- Immediate Isolation: Disconnect affected systems to prevent further damage or loss.
- Notify: Inform stakeholders, users, and, if necessary, legal authorities (especially in the case of data breaches).
- Assessment: Use AWS CloudTrail and other logging systems to assess the root cause.
- Data Restoration: Restore data from backups using services like Amazon S3 or Amazon RDS snapshots.
- Security Measures: If a breach has occurred, employ AWS Shield and AWS WAF for enhanced security and engage a security specialist team for a thorough review.
- Lessons Learned: Conduct a thorough review to understand vulnerabilities and to implement improved preventive measures.
- Documentation: Create a detailed incident report for reference.
4. Catastrophic Disruption (e.g., Complete Service Shutdown, Natural Disasters)
Steps to Remedy:
- Emergency Communication: Immediately inform users and stakeholders about the extent of the disruption.
- Invoke DR Plan: Activate the Disaster Recovery plan, redirecting traffic and workloads to a failover site using services like Amazon Route 53.
- Resource Allocation: Mobilize dedicated teams to work on restoring services, data, and infrastructure.
- Service Restoration: Use stored AMIs, AWS Elastic Beanstalk versions, and other snapshots to quickly launch replacement instances and restore applications.
- Data Restoration: Prioritize restoration of critical data from backups in Amazon S3, RDS, or DynamoDB.
- Continuous Updates: Keep users and stakeholders continuously informed about the restoration progress.
- Review: Once resolved, conduct a thorough analysis to determine the cause and prevent future occurrences.
- Documentation: Draft a comprehensive incident report detailing the event, impact, response, and future preventive measures.
Alternative process
We have considered, as an alternative process, a Cold Site Recovery. However, for the reasons listed below, we decided not to implement this and stick with the Hot Site process.
In contrast to the previously described process, a Cold Site Recovery process utilizes offline resources that can be brought online when a disaster strikes. This method is less expensive than maintaining a hot site. However, it would create a much longer Recovery Time Objective (RTO) and it basically relies on devices not being connected to the internet.
Due to the software-only and cloud-only nature of our device, the Cold Site Recovery process does not make much sense. Firstly, because the longer downtime would not be acceptable. And secondly, because if there no internet, the device can't even be used; so recovering it offline would not make any difference to the safety of the device.
Disaster recovery documentation
Depending on the disaster and the speed needed to solve the problem, the documentation of the incident or disaster and its recovery will be performed at the end of the process to ensure that the service is properly operating and to cause the minimum impact on the service provision.
The report will contain all the aspects considered during the DRP activation, the actions performed, the satisfactory restoration evidence and a root cause analysis of the disaster to ensure there is no additional effect of the disaster on the service provided.
Data back ups
API REST back ups
As we specify at the GP-012 Design, redesign and development
procedure, the APIdata is meticulously updated at regular intervals to ensure optimal performance and up-to-date information. With a precise update frequency of 12 hours, our robust system guarantees that the most current data is consistently available to users.
In order to safeguard the integrity of our data, we have implemented a comprehensive backup strategy. Our carefully devised plan utilizes an incremental backup approach, which efficiently captures and stores any modifications made to the data since the last backup. This method not only reduces storage requirements but also minimizes the time and resources needed for the backup process.
By employing this best-in-class data update and backup system, we maintain a high standard of reliability, efficiency, and security, providing our clients with the utmost confidence in the quality and accuracy of the information provided by our API.
QMS documentation backups
As we detailed at the GP-001 Control of documents
, by leveraging the inherent features of Git, our QMS benefits from multiple local backups of the documentation. Each collaborator maintains a complete copy of the project, enabling redundancy, availability, and ensuring the integrity of the information.
The commit history and collaboration capabilities of the QMS based on Git allow us to check and validate the local backups every time a collaborator perform a new commit or pull requested, as it is confirmed that the back up and the original QMS completely match without deviations.
Continuity tests
Maintaining confidence in our disaster recovery (DR) and business continuity (BC) strategy requires regular testing and validation. By conducting continuity tests, we not only ensure that our procedures for rapid restoration are effective but also guarantee that our Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets are consistently met.
We use one testing method:
Automated Failover Testing using AWS Fault Injection Simulator
The purpose is to automatically test our AWS workloads' resilience to failures without human intervention. We use AWS Fault Injection Simulator to conduct controlled fault injection experiments like server failures or network disruptions.
Thanks to this, we dentify weak points in our architecture and ensure the automation tools respond appropriately to failures.
Test report
The records of the tests are stored in AWS Fault Injection Simulator, which are automatically run every 12 months an contain all the information, including:
- Entity who has executed the tests.
- Date, time and duration of the tests.
- Environment over which the test was conducted.
- Resources used.
- Incidents detected, documenting any problems encountered.
- Comparison with previous test times.
- Record any deviations from the test plan.
Test 02: Environment Replication Using CloudFormation Templates
The purpose of this test is to duplicate the production environment to create a controlled testing ground without risking production systems. The way it works is: we use CloudFormation templates to define our entire AWS environment, including networks, servers, databases, and other resources. Then, we deploy these templates in a separate AWS account or region, which mimics the production environment. This isolated sandbox lets us test various scenarios without affecting the live environment. Here, we conduct simulated failures and data recovery tests. */}
Si vamos a realizar varios tests habrá que indicar el número de test (lo asignamos de forma consecutiva) y el nombre de cada uno. Habrá que indicar qué se hace en cada test y qué se espera obtener y cómo se genera y almacena la evidencia de realización y aceptación.
Test XX. Test name
- Development of detailed procedures for implementing recovery strategies
- Identification of required resources, such as personnel, equipment, and facilities
- Documentation of roles and responsibilities for key personnel
- Development of procedures for rapid restoration of IT systems and data
For planned maintenance actions or in case of failure, next actions will be performed:
- Notification to all company personnel via email with the affected services and utilities, the expected time for the maintenance and the defined actions that they must do if required.
- Notification to the company General Manager and IT responsible explaining the problem or the maintenance purpose and the required actions.
- Migrate critical functionalities and services to the other data center, including change DNS configuration if it’s a web service.
- Check that the migration is completed successfully.
- Ensure proper, secured and controlled shutdown of the affected physical server, this includes virtual machines following the required order if needed.
- Perform maintenance actions depending on the intervention.
- Switch on the computer and restart all the virtual machines and services hosted in the server.
- Check integrity and intervention success
- Migrate back the services to the recovered data center
- Check that the migration is completed successfully.
- Notification via email to the team with the maintenance finalization and the actions that must be done if required
Test report
Abajo incluyo un ejemplo de lo que debería recoger el report del test, pero puede ser incluso una captura de pantalla de un test realizado, un pdf...
The last activity to be performed is to generate a report describing the test performed and the results obtained. At least it will contain:
- Person who has executed the tests.
- Date, time and duration of the tests.
- Test environment.
- Resources used.
- Incidents detected, documenting any problems encountered.
- Compare with previous test times.
- Record any deviations from the test plan.
Post-test review:
- Were the parameters correct?
- Were the objectives met?
- Were the measurement criteria included in the report correct?
- Identification of problem areas, strengths and deviations from procedures.
- Recommendations for improving the plan.
Maintenance of records in the case of bankruptcy
In the case of bankruptcy, in order to ensure the continued availability and integrity of critical records, as required by the European Medical Device Regulation (MDR), we use cold storage services.
This applies to all records and documentation related to the Quality Management System (QMS) as per ISO 13485 standards, including but not limited to, design and development files, manufacturing records, quality control data, and post-market surveillance documentation.
Cold storage services refer to a type of data storage that is ideal for archiving data which is not frequently accessed but needs to be retained for long periods.
Utilizing cold storage ensures that our QMS records are securely stored and accessible for at least 10 years post-bankruptcy, fulfilling the requirements of the MDR.
To fulfill this requirement, we allocate a fixed budget of 500 euros specifically for this purpose. Then, we pay in advance for a service period of at least 10 years. This prepayment ensures service continuity even in the event of bankruptcy.
The supplier will be the same cloud provider that we use at the time of the bankrupcy for the QMS and the device storage. Thus, the cloud provider will the the one specified in the GP-010 Purchases and suppliers evaluation
. At the time of first writing this, the supplier will be AWS, who offers the Amazon S3 Glacier cold storage service. All major cloud suppliers offer equivalent services.
Maintenance of IFU
There is a specific record that we will mantain not only in a cold storage, but also deployed in a live URL that anyone can access. This record is the Instructions For Use of our device or devices. Unlike most records, which must be kept 10 years post-bankruptcy, the IFU must be kept 15 years post-bankrupcy. Except for these two differences, IFU is kept following the same method as other records of our QMS.
Training and awareness
We provide regular training to all relevant employees on the importance of record preservation and the specific procedures outlined in this policy. This is part of our fundamental QMS and GDPR training. This ensures that the awareness of this policy is a part of the onboarding process for new employees.
BCP & DRP maintenance
Modification to this plan is based on both scheduled an unscheduled event:
- Periodic maintenance
- The information covered by the plan can change over time. Changes to the facilities, personnel and responsibilities must be periodically reviewed to ensure accuracy. For this reason, this document is revised annually according to the
R-002-005 Quality Calendar
, along with the data recovery and restoration tests described on previous section.
- The information covered by the plan can change over time. Changes to the facilities, personnel and responsibilities must be periodically reviewed to ensure accuracy. For this reason, this document is revised annually according to the
- Experience registration (unscheduled events)
- Lessons learned from an actual disaster experience should be documented thoroughly to improve recovery readiness and to avoid potential future system downtime.
If, at any given time, we detect a non-conformity during the disaster recovery tests, the tests will be repeated. In case of non-compliance, the finding will be treated in accordance with the procedure GP-006 Non-conformity. Corrective and preventive actions
.
Annexes
Staff contact list
Name | Position | Phone | |
---|---|---|---|
Andy Aguilar | General Manager | andy@legit.health | |
Taig Mac Carthy | Design and Development Manager | taig@legit.health | |
Alfonso Medela | Technical Manager and PRRC | alfonso@legit.health | |
Gerardo Fernández | Technology Manager | gerardo@legit.health |
Key suppliers contact information
Amazon Web Services EMEA SARL
- NIF B 186284
- email eu-privacy@amazon.co.uk
- Address: 38, Avenue John F. Kennedy, L-1855 Luxemburg.
Key customers contact
We have available all the customers contacts within our customers management tool Hubspot. We also connect via API to HubSpot to create backups in our AWS servers.
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001