How to Prepare for the Next Cloud Outage

By: Morpheus Data

A big part of an IT manager’s job is asking, “What if…?” As cloud services become more popular, disaster plans are updated to include contingencies for the moment when the company’s cloud-hosted data resources become unavailable.

Two recent service failures highlight the damage that can result when a cloud platform experiences an outage:

  • On February 28, 2017, an Amazon employee working on the company’s Simple Storage Service (S3) executed a command that the person thought would remove a small number of servers from an S3 subsystem. As HyTrust’s Ashwin Krishnan writes in a May 4, 2017, article on Light Reading, the command was entered incorrectly, which resulted in the removal of many more servers than intended.
  • On March 21, 2017, many customers of Microsoft’s Azure, Outlook, Hotmail, OneDrive, Skype, and Xbox Live services were either unable to access their accounts, or they experienced severely degraded service. The company traced the outage to a “deployment task” gone awry; rolling back the task remedied the problem.

The proactive approach to mitigating the effects of a cloud failure

If you think the threat of an outage would deter companies from taking advantage of cloud benefits, you obviously haven’t been working in IT very long. There’s nothing new about ensuring the high availability of crucial apps and data when trouble strikes. On the contrary, keeping systems running through glitches large and small is the IT department’s claim to fame.

NS1’s Alex Vayl explains in an article on IT Pro Portal that the damage resulting from an outage can be greatly reduced by application developers following best practices. Vayl identifies six keys for designing and deploying high availability applications:

  1. Prioritize the components comprising the app stack based on the project’s cost/complexity, the impact of a failure on users, and the likelihood of each component failing.
  2. Minimize the risk that could be introduced by third parties, primarily cloud service providers, by ensuring they follow best practices themselves.
  3. Consider the app components that need to be maintained in private or hybrid clouds due to compliance and regulatory considerations.
  4. When technically feasible, configure critical app components in hybrid clouds that are backed up via a separate cloud service; alternatively, replicate data across at least two zones and automatically reroute traffic if one zone fails.
  5. Implement an intelligent DNS setup that shifts traffic dynamically based on real-time user, network, application, and infrastructure telemetry.
  6. Make sure your DNS isn’t a single point of failure by contracting with a managed DNS provider, as well as with a secondary DNS provider.

"

The cloud lets you implement abstraction at the infrastructure, platform, or application level, but improving availability via multi-zone, multi-region, or multi-cloud increases management complexity. Source: Cloudify

Redundancy nearly always adds to the cost and complexity of cloud networks. However, the expenses incurred as a result of a prolonged network outage could dwarf your investment in redundant data systems. The best way to minimize the risk of a catastrophic system failure is by eliminating single points of failure. The best way to avoid single points of failure is by investing in redundancy.

In other words, you can pay now for redundancy, or pay later for massive data losses.

Getting a grip on the ‘three vectors of control’

Preparing for a cloud service outage isn’t much different than getting ready for any system failure, according to HyTrust’s Krishnan. No matter the nature of the network, there will always be three pinch points, or “vectors of control,” that managers need to master.

The first is scope, which is the number of objects each admin or script is authorized to act upon at a particular time. Using the Microsoft outage as an example, a deployment task’s scope would limit the number of containers it could operate on at one time.

The second control vector is privilege, which controls what type of action an admin or script (task) can take on an object. An example of a privilege restriction would be a task that is allowed to launch a container but not to destroy one.

The third control point is the governance model, which implements best practices and your policy for enforcing the scope and privileges described above in a “self-driven” manner. For example, a governance policy would limit the number of containers an admin or script can act on at one time to no more than 100 (scope) while also providing a predefined approval process for exceptions to the policy.

Despite the highly publicized outage of February 28, 2017, AWS has experienced much less downtime than cloud-platform competitors Microsoft and Google. Source: The Information

Multi-cloud’s biggest selling points: Backup and redundancy

The major cloud platforms — Amazon, Microsoft, and Google — have different attitudes about transferring data between their respective systems. Google’s new Cloud Endpoints API is designed to integrate with AWS’s Lambda function-as-a-service product, as NetworkWorld’s Brandon Butler writes in an April 27, 2017, article. This allows you to use Endpoints to manage API calls associated with a Lambda application.

Among AWS’s cloud connectivity products are Direct Connect, Amazon Virtual Private Cloud, and AWS Storage Gateway; coming soon is an easier way to run VMware workloads in AWS clouds. Yet AWS executives continue to downplay the use of multiple clouds, even for backup and redundancy. They insist it’s easier, less expensive, and just as effective to split your workloads among various regions of the AWS cloud.

Many analysts counter this argument by pointing out the future belongs to multi-cloud, primarily because of the approach’s built-in redundancy and resiliency in the event of outages — whether caused by human error or a natural disaster. Butler suggests that one way to hedge your bets as the multi-cloud era begins is to adopt a third-party cloud integration tool rather than to rely on your cloud provider’s native tools.

(For a great primer on everything multi-cloud, check out TechRepublic’s “Multi-cloud: The Smart Person’s Guide,” published on May 4, 2017.)