Building Capabilities of Incident Response/Disaster Recovery on the Cloud

It is impossible to completely mitigate the compromise of critical data resulting from incidents or disasters. How do organizations ensure implementing a reliable incident response strategy for their cloud environment? Several influencing aspects must be considered by organizations to prepare for and opt a certain level of protection. No organization is 100% safe from disasters, however, implementing a disaster recovery strategy prevents downtime and opportunity loss.

Why is it essential for organizations to have a disaster recovery strategy in place? The recent AWS outage is a prime example, as it caused a downtime of nearly 7 hours for several customers. Without solid systems in place to automatically alert against outages, organizations are left vulnerable to devastating incidents. Solutions such as cloud backup systems and automated monitoring prevents threats from situations like these.

Connectivity – A Key Challenge

Network configurations are by far the most critical area to consider in recovering from any type of outages, with connectivity being a major concern among on-site users as well as for those working remotely. Aspects that can potentially create challenges include remote access, repointing internal& external domain Name System (DNS) settings, and connecting email. Reconfiguring can be difficult but possible if organizations possess documentation and credentials authority. They must also make sure that these are not stored in inaccessible systems and taken away by outages.

Cloud service providers are usually responsible for availability of systems and protection of the underlying infrastructure whereas managing the data is the customer’s responsibility. Setting up DNS involves potential security vulnerabilities when network details remain unsecured in the failover and restore phases.

How to Plan for the Unknown?

The scenarios of failover and restore phases vary significantly and sometimes, organizations witness partial failovers and restores during reconnecting the cloud, on-premises, or remote sites. Organizations must brainstorm all possible results associated with outages and their impacts on resources when they plan for disaster recovery. This will help them wrap up things at the general level.

Classifying consequential effects of security, connectivity, and access that must be addressed is critical. There is a good chance that preparedness for disasters can be tested through simulation and automated for prevention of data loss during any possible downtime.

Protocols to Consider for Disaster Recovery/Incident Response

RTO and RPO

Recovery time objective (RTO) and recovery point objective (RPO) are key metrics to consider before planning for disaster recovery and analyzing specific business requirements. In the case where data modifications are not frequent, the RPO is higher as the risks of losing any changes are low. On the other hand, where data modifications are constant, the RPO remains low. RTO, however, is the maximum acceptable time for cloud services to be offline during an outage.

Security and Compliance

Besides ensuring that the backup of applications and data is aligned to the production environment, organization must adhere to the controls of security and compliance. Replicating the user access controls, for instance, helps manage permissions across DR clouds and production environment.

Testing for Recovery

Some important areas to consider here include

  • Replication of permissions for users to access the cloud backups and checking if they can log in to the DR environment to perform their tasks.
  • Security controls safeguarding the environment and if they can clear the penetration test.
  • Ability to meet RTO and RPO
  • Accommodation of high load when users regain their access to the system.

Cloud Monitoring

A high degree of visibility is critical for organizations for early identification of issues in their cloud environment, applications, and services. This helps them take necessary actions and mitigate the issues before time. Achieving a low RTO and RPO depends on whether automated cloud monitoring is implemented or not.

Prioritize What Matters

According to Gartner, over 70% organizations are not prepared when it comes to disaster recovery or incident response capabilities. Many options might bring cost benefits to the organizations but not necessary to implement due to other underlying factors. The disaster recovery plan to offset any outage must follow a risk-mitigated approach.

Recommended Read: