Home / News, Views and Opinion / AWS outage: the lessons that must be learned

AWS outage: the lessons that must be learned

Aras Nazarovas explains what business leaders can do to protect operations in the future

The biggest lesson from the AWS US-East-1 region outage on October 20th, is that most of the affected companies were lacking resources to ensure business continuity. It shows that even when using some of the most trusted and reliable cloud solutions, it is unreasonable to expect a 100% uptime. 

Therefore, to ensure that your business continues to operate even when a major cloud provider is experiencing issues, you need to have backups of data, as well as computational resources in other places, either with other cloud providers, or on-site solutions.

When a single cloud region fails, the most cost-effective solution would be to configure your services to also be hosted on other cloud regions, though this could still fail if multiple cloud regions fail, or something breaks at the provider level, upstream of cloud regions. Therefore it is best to have usable, fallback infrastructure, either on site or with other cloud providers such as Google Cloud, Microsoft Azure, or DigitalOcean.

There are several operational blind spots this outage exposed. For years businesses chose to mostly select a single provider to host most of their infrastructure to reduce costs and complexity. This however comes with risks as seen on October 20th. If your single provider fails, your business is left stranded and with very limited control over the situation. 

Such potential failures should be taken into account when designing services that are critical to always remain operational, in these cases having infrastructure mirrored across providers, or onsite fallback solutions may be worth investing in, even when the cost increases significantly.

Following this outage, business leaders should initiate several things.

First, create an inventory of hosts and services that they operate and offer to their customers, prioritise which services and hosts are critical to remain operational at any cost, and which services can be temporarily disrupted without creating a significant disruption (e.g. for a messaging app you may prioritise the functionality of being able to send and receive text messages and giving less priority to the functionality that enables sending images videos, stickers, read receipts, etc.) create a plan, budget and timeline to implement this additional redundancy. 

It’s also important to create a disaster recovery plan, or update the existing one by reconsidering the potential risks, and issues that may arise from similar failures and create solutions for these potential issues. This saves valuable time when a similar issue occurs allowing teams to efficiently and without hesitation or second guessing ensure that critical systems are restored as soon as possible.

One of the most effective ways to ensure a swift recovery from third party cloud failures is having a predetermined plan on what to do when such an issue occurs. It ensures that teams do not need to come up with solutions or make difficult decisions when time is of the essence, allowing them to promptly implement any required changes. Having multiple alternative lines of communication is also critical to ensure efficient coordination of recovery efforts.

Aras Nazarovas is a Senior Information Security Researcher at Cybernews, a research-driven online publication. Aras specialises in cybersecurity and threat analysis. He investigates online services, malicious campaigns, and hardware security while compiling data on the most prevalent cybersecurity threats.

Check Also

Exploring linear, rotary and sine encoders

Encoders are the most widely used feedback devices for electric motor control, capable of delivering …

3D printed robotics components help bring the harvest home

Sasha Bruml and Felix Manley examine the supply of custom polymer components on demand to …

How to secure your inventory in demanding environments

The year-end inventory is often a true stress test for operational efficiency in retail and …