She told iTWire that as the availability bar rose, the ability of IT organisations to meet it was diminishing.
"They are expected to do more with less. To make matters worse, application and IT infrastructure are becoming increasingly complex. Even when budget permits, finding skilled personnel is difficult," Dhand said.
Dhand has been at Nimble since 2012. Prior to that she was senior product manager, cloud infrastructure and management group at VMware.
The rest of the article is in her own words.
The uptime of critical business systems is particularly important for high stakes industries like finance, healthcare and service providers whose success relies upon quickly fulfilling customer needs.
As the availability bar rises, IT organisations’ ability to meet it is diminishing. They are expected to do more with less. To make matters worse, application and IT infrastructure are becoming increasingly complex. Even when budget permits, finding skilled personnel is difficult.
The challenge of being reactive
The traditional reactive and therefore inefficient approach to availability is inadequate to meet these new expectations. Once there is an availability disruption it can take significant time, effort and skills to restore the system – all very precious commodities. It can even lead to a non-productive blame game between owners of the different tiers of the application stack.
Ensuring availability involves designing redundancy across every part of the application stack – typically the hardware and software infrastructure, database tier, middleware tier and the top tier. The theory is that if a piece of software or hardware fails, other parts will take over. Some systems such as storage employ RAID and similar approaches. Here the data on the faulty component is constructing using information stored on other similar components. Each layer of the stack is monitored using raw health and performance data, often using separate monitoring systems. When an issue is detected, an alert is sent with no context of impact on the application and its availability.
The IT admin now begins the fire drill – the arduous and time-consuming task of troubleshooting. Typical steps are:
- Determine if the alert was a false alarm or if there really is a problem.
- Determine the severity of the issue, e.g. did it disrupt application availability or is it local to a given layer and absorbed by the redundancy design.
- Determine the root cause of the problem. Often this requires collaboration across experts for each stack who may belong to different teams.
- Often the vendor gets involved and the process of system data collection begins – sending countless log files, outputs of diagnostic commands and even config files to their Tech Support.
All this needs to happen in a very compressed period.
The answer is predictive approach
The status quo way isn’t enough to achieve the new standards of availability. IT admin needs the ability to predict, and fix problems before they occur. The system should be able to fix itself preventing application downtime. If the system cannot self-fix the problem, it should give a precise recommendation to the admin to implement the fix proactively. For harder to predict issues, the system should make troubleshooting instant and painless. False alarms have no place in a world that demands unprecedented data availability.
Machine learning and good data science make all this possible. Modern data centre products are instrumented with millions of sensors. Vast amounts of real-time telemetry are collected from each layer of the stack, across many deployments. This telemetry contains data about performance, health, configuration, events, resource utilisation and various system states.
This is done across a large installed base leading to the knowledge of diverse environments and real-world configurations. The data is then processed in powerful analytics engines. A deep understanding of the entire stack is developed. The system learns of complex patterns in each layer of the stack, and how these patterns interact across the layers, over time. Models are created, then refined on a continuous basis using the data from the install base as well as the new information fed by the product vendor. A clear and high confidence understanding of normal v. abnormal behaviour is established.
The result is the ability to predict with high confidence, issues that could potentially cause application downtime. A predictive engine determines how to prevent the issue. Systems are designed to take these preventative steps automatically or defer to the IT admin. There is a new breed of data centre products that have this capability built-in – from instrumentation to collection of telemetry and analysis to prediction and prevention. There is even a class of products where the sensors are part of the fundamental product design and created when the first line of code was written.
This predictive approach combined with machine learning techniques has made it possible to achieve availability levels once deemed impossible. IT admins receive fewer and more meaningful alerts. They can determine the root cause instantly, across the application stack, all by themselves.
No need to understand the inner workings of each layer of the stack and engage with several teams. When the admin does need vendor technical support, the experience is very different from the traditional way. Tech Support already has deep knowledge of the customer environment and can start suggesting fixes within an incredibly short period.
This frees up IT to focus on more meaningful activities such as planning and executing on innovative ways of solving business problems.