With Andrew Buss, originally published on The Register
Cloud computing is all about providing services as reliably and transparently as possible to the end-user. One of the pivotal underpinnings of cloud is that applications are not set up on dedicated infrastructure, but instead are accessed as service via communications networks such as the Internet.
When it comes to Private Cloud, the workloads and connectivity are typically architected and managed together with insight into both current needs and planned future growth strategy. This means there is a reasonable degree of predictability in setting up and managing services, although our research shows that getting it to run like clockwork still remains challenging for many.
Public Cloud, on the other hand, is another matter entirely. Customers may come and go unpredictably, as will the workloads they wish to run. This leads to three main issues that need to be addressed.
The first challenge is that of coping with growth through the addition of new customers. Adding incremental capacity needs to be as predictable and seamless as possible for networking – as well as storage and compute – and must not require more people to manage unless there is a step change in what is being put in place. The end result though, is scalability in one direction for a Cloud provider, which is up. It is difficult to cope with an overall drop in demand without removing capacity that has been paid for. Therefore it is vital that effective use is made of the investment in infrastructure before adding more.
The second challenge is dealing with peaks of demand. Workloads are unpredictable, and can require significant additional capacity to cope if all workloads are to be catered for. Having a mix of customers so that individual peaks are spread out can alleviate some of this. It is also often easier for large-scale service providers to absorb the impact of peak loads due to their overall size and capacity.
In order to increase utilisation of the network and deal with peak demand while using networking equipment efficiently it will generally be necessary to prioritise important workloads by identifying them and allowing them to run in preference. Other, less critical traffic or applications can be slowed or delayed. This will usually be accomplished through either service level agreements or tiering of services, where those paying for the privilege have priority of access.
The third, and arguably most important, aspect of Public Cloud networking is not so much to do with the physical equipment, but more the change management involving people, policies and processes. In recent major service outages, such as Amazon AWS, or Microsoft Office 365, it was not equipment failure that led to widespread and unexpected service outages. Instead, it was making changes to the established setup that caused problems.
It is vital to understand service dependencies and test changes – even going so far as to simulate or model them where possible – before putting them into production. And it is just as important to be able to roll back changes where there is an impact on service quality. For this to be effective, there needs to be a measurement system in place that pro-actively monitors end-to-end service quality that can quickly flag up when things are starting to veer away from normal limits before the problem snowballs.