In this post, the latest in our series of on Understanding the Machine Learning in AIOps, we’ll start to investigate how machine learning can solve many of the problems that are faced everyday in IT Operations, and specifically at how it helps with the process of data ingestion and the reduction of Alert Fatigue. Our previous posts looked at some of the terminology used in and around machine learning, and gave a high-level explanation of some machine Learning techniques — specifically, clustering, classification, and regression, and the types of problem that they are best suited to solving.
What are IT Ops Teams trying to achieve?
It is stating the obvious to say that the ongoing objective of IT operations teams is to minimize resolutions times, reduce costs, and eliminate customer impacting outages, and there are a huge variety of ways for a team to meet those goals.
Breakages are a fact of life in any system, regardless of the underlying architecture. It is how an operations team deals with those failures, and the quality of the tools at their disposal, that allows them to achieve their goals, and ultimately a business’ needs.
No one deliberately sets out to design a system that is hard to manage or prone to failure, but some architectures can increase the demands on an operations team. Often, the system architectures that a business requires — virtualisation, micro-services, continuous deployment, etc. — are the ones that can add significant management complexity, and increase the number of points of failure in that system, making the tools that are available to an operations team all the more important.
The Pain Points
The pain points that ultimately manifest themselves as long resolution times and customer-facing outages stem from things such as :
- Alert Fatigue
- Difficulty in identifying the cause of a problem
- Inefficient communication
- Poor collaboration
- Poor remediation processes
Adopt an approach and toolset that solves these issues, and your team is no longer fire-fighting, but has the time to improve; you are using time now to save time in the future, all while meeting the commitments made to your customers. These are the problems that AIOps addresses.
Where does Machine Learning Fit into IT Operations?
As we covered in our first post in this series, machine learning and artificial intelligence are used everywhere, and there is no denying that these technologies can produce some real “wow” solutions. But simply throwing machine learning at a problem is not the answer.
The variety of techniques is huge, and the right ones need to be adopted for the problem at hand. And although it seems sacrilege to say it in a post about machine learning, there are some circumstances where a logic-based algorithm may be the best way to go. For other pain points, whether or not the behind-the-scenes algorithms use machine learning techniques is not the whole story. Algorithms need to be coupled with a clean and efficient user experience, and sometimes it’s the UX innovations that are key.
So let’s dig into some of the pain points around event ingestion, and see where machine learning techniques can provide some or all of the solution.
Alert Fatigue
Examples of alert fatigue exist everywhere. It is exemplified by those things that happen around us everyday that we ignore because they are so commonplace. When was the last time that you really took notice of a fire drill?
Alert fatigue comes about through the avalanche of data that modern systems generate. In a modest sized enterprise, an IT infrastructure can generate millions of events a day, add in raw time-series data to that, and the data volume can increase hugely. And, buried in all of those application heartbeats and ‘authentication failed’ messages will be the handful of alarms that pinpoint a customer-impacting failure and its underlying cause.
Minimizing alert fatigue isn’t simply about reducing the volume of events that need to be processed, though — that’s easy, and the wrong approach. Filtering an event stream to ignore certain sources, and thresholding to only process “Critical” alarms are examples of techniques that will reduce the volume of data, but at the same time discard what could be the cause of your problem. They are also techniques that need maintaining — by a human.
One of the most enduring techniques for volume reduction that can still be of benefit is event deduplication — the act of collapsing repeating events to a single alert. On its own it’s an approach that can no longer produce the impact required. The volume of data, even after deduplication, is still huge. But to its advantage, it doesn’t remove data from the system. All your data is still there, it’s just that your team is presented with less of it.
The real solution to alert fatigue needs a different approach, and it’s an approach made up of several stages.
Yes, it is about discarding those alerts that are meaningless, but it is also about processing what’s left in a way that allows your ITOps team get to the cause of the problem quickly, and displaying the information that gets them there in an easily consumable way.
It’s about knowing what “normal” is, and what it isn’t, and this is where machine learning and data science techniques are needed — using past data to provide a benchmark of what is normal for your infrastructure.
Is that Normal?
For the initial part of the process, AIOps uses a concept of “entropy” as a way of achieving noise reduction (see our earlier post about entropy). Entropy encapsulates the “What,” “When,” and “Where” of an event: What is the the event? When does it happen? And where in the infrastructure is it coming from? And we use that to build a picture of what is “normal,” based on the analysis of past events. So we can then evaluate whether events entering the system require an action or not.
But there is a different type of data that needs additional treatment — time series data.
Time Series Data
Time series data is the periodic reporting of status data from a server, or application reported by a monitoring solution. If all is well, your server may well report that it has “25% CPU utilization,” and “45% free disk capacity,” and it will keep doing that every five minutes, until you tell it to stop. Another 288 events per metric per server per day. Almost every one of which will be reporting, “I’m fine, there’s nothing to see here!”
An ops team doesn’t need to see this sort of data, they only need to know when something has gone wrong, when something is out of the ordinary, and this is where we step into the world of outlier detection and anomaly detection.
Outlier detection and anomaly detection are terms that are sometimes used interchangeably. At Moogsoft we favour the following definitions.
An outlier is a value of a metric that is different from other values of that same metric when you would expect them all to be similar. For example, the CPU load on the servers behind a load balancer may be expected to lie within a very specific range. Let’s say CPU utilization fluctuates between 40% and 50% but at one specific time of day there is a single server running at 70% CPU. That specific measurement may be classified as an outlier, it is different to all the other servers’ CPU usage.
The presence of an outlier may also indicate an error in your monitoring, a value that is so far from expected that perhaps it’s not the systems being monitored, but the method itself. However, just because a value is an outlier in one context, it doesn’t necessarily mean that it is anomalous. An anomaly is where a measurement doesn’t follow historical trends. So our hot-running server, whilst an outlier in the context of similar servers at a specific point in time, may always run at 70% because, for the sake of argument, it is the master server in an HA group. If its utilization spiked to say 95%, that would be an anomaly because it is not following its historical behaviour.
But what has this got to do with machine learning & alert fatigue? Simply ignoring time series data is not an option, despite the volume of data, so instead of forwarding every piece of time series data to your operators, you should only forward your anomalies. But how do we do that?
A simplistic approach is to use a simple threshold: If CPU is greater than 80% it’s an anomaly. But what is right for one set of servers won’t be for another set. Maybe CPU spikes are expected at certain times of day but running at 95% in the middle of the night isn’t. Accounting for these scenarios with manually created rules quickly becomes too complex for these thresholding techniques.
The more complex your criteria become, the more complex the underlying algorithms tend to be. Some very effective (and non-machine learning techniques) algorithms do exist to capture these use cases, techniques such as dynamic thresholding and “seasonality & trend decomposition.” But to capture the full array of scenarios, you need to add machine learning techniques into your algorithmic toolbox.
Unsupervised clustering techniques such as k-means, or nearest-neighbour clustering are often used for outlier detection. But when business needs require us to identify whether a metric is following historical behaviours and trends, we soon get into the state-of-the-art deep-learning-based solutions using recurrent neural networks — solutions involving techniques such as “Hierarchical Temporal Memory” or “Long Short-Term Memory.”
Enrichment (WIP)
At this point in the life cycle of an alert we have de-duped our event stream, removed the noise, and are reporting only anomalies from our time-series monitoring solution. But the impact of machine learning on data ingestion doesn’t end there.
The more AIOps knows about an alert, the higher the accuracy with which that alert can be processed. But the richness of the data in an alert is highly dependent upon its source. Events forwarded from an APM platform will contain highly relevant data about an application and the services that it provides. The SNMP Traps generated on your network hardware comply with strict protocols and generally contain well structured, explicitly labeled data.
Contrast that with the events from a data aggregator or a raw application log file, and the situation is very different. Your systems need to be able to extract the relevant parts of the message to create a coherent alert.
As always there is a solution that relies upon manually created and maintained rules: regular expressions to match tokens such as IP addresses and dates and time, keyword matching such as “LinkUp” and “LinkDown” to match fail/clear pairs, or “login fail” and “invalid password” to indicate the alert is related to a security issue. While this approach has utility and can be highly effective, it is an outdated approach. The complexity and quantity of the different look-ups soon becomes overwhelming: the maintenance issue is obvious and it is surprisingly resource intensive when applied in real time against thousands of alerts per second.
It will come as no surprise by now that machine learning techniques can help us out. Named Entity Recognition techniques borrowed from the field of Natural Language Processing provide more efficient ways of extracting different types of tokens from the event text. Supervised learning techniques such as classification can help us identify the class of an alert: is it related to “audit” or “security,” or is it from an application or a piece of network hardware, even if it is a state-transition event as part of a fail/clear pair?
So far we’ve looked at some of the ways machine learning can be used to help ingest data and reduce the volume of data presented to your operations team whilst retaining the important stuff. But that’s not where we end our quest to reduce Alert Fatigue. In our next post we’ll continue the journey of an alert through the management process, and look at alert correlation and remediation and how machine learning can further reduce the pain points that are stopping your operations team being as effective as they could be.