The long and winding road to application service availability

Strange as it may seem today the days of hyper-resilient systems have yet to arrive. Applications today suffer service disruption events on a reasonably frequent basis. Our research shows that some 84 percent of organisations experience some form of disruption due to application failure at least once a quarter, with almost one organisation in four encountering such instances at least every week.

A quick dive into the figures shows that the most frequent cause of application disruption are laid at the hands of “software component failure”with “network failure or poor performance” being identified as the second most likely area to cause problems. “Physical component failure” trails in some way behind in third place with “power outages or brownouts” lagging far at the back as an identified major factor in application service disruption. Clearly power is not yet a major problem for most organisations when it comes to unscheduled interruption to service.

In fact these results are pretty much in line with experience. Twenty years ago the proportion of interruptions caused by hardware component failure would have been much higher, but systems have got better over time. So what’s causing the software problems? The figure below gives some hint at where most issues lie, and the answer is, essentially, to be found in the people and process side of IT operations and the way organisations handle testing and service scoping.

As can be seen, inadequacies in configuration and change management is identified as the most frequent cause of application failure with system sizing / capacity planning challenges coming in close behind. More general “IT staff error” is given as the third most common cause of service interruption with “patch management issues” having a similar interruption profile. “Security breaches” are far and away the least problematic challenge, at least as far as people are willing to admit in an anonymous survey.

The Freeform Dynamics’ report “Risk and Resilience – The application availability gamble” clearly highlights that despite the increasing use of sophisticated management tools, virtualisation systems and even especially resilient server and storage platforms there is still considerable room for improvements in most areas of service provision. Further it is very apparent that the area where there is most scope to increase resilience is to found in the processes that organisations employ to support their applications and its underlying IT infrastructure.

Hardware is getting better and more highly available and many software and management developments, including but not limited to virtualisation, are also making service resilience more affordable. But without changing the support process side of things availability will get better but we will not see significant improvements. Alas, with organisations looking putting greater pressure on IT staff day by day this is perhaps the hardest area to address in the current work / economic climate but it is one where major benefits are obtainable.