Understanding the Machine Learning in AIOps: Part 4

In this, the last installment in our series on “Understanding the Machine Learning in AIOps,” we continue the journey of an alert through the management process, and our investigation of how machine learning can be used to reduce alert fatigue.

In our previous post, we concentrated specifically on reducing alert fatigue during event ingestion. Prior to that we looked at the background to machine learning, examining some of the terminology and buzzwords, along with an overview of the different types of problems that machine learning can be applied to, and their solution techniques — techniques such as clustering, supervised and unsupervised learning, and problem types such as classification and regression.

Alert Fatigue

During the event ingestion process, deduplication and unsupervised machine learning techniques such as Entropy are well known approaches that AIOps exploits to reduce event noise. Add in anomaly and outlier detection for processing time-series data, and we have an effective way of reducing thousands of events to little more than a handful of alerts. But reducing alert fatigue doesn’t stop there.

The whole concept of situational awareness, part of the founding principles behind AIOps, brings about the next layer of relief from the pain of alert fatigue. Add a sprinkling of Probable Root Cause into the mix, and we start to attack our other pain points too, specifically: identifying the cause of an incident, and poor remediation processes.

Moogsoft has leveraged machine-learning with the aim of helping customers reduce the noise around alerts that many enterprises are struggling to get a grip on.

Once an alert has been ingested, assigned an entropy, deduplicated, and enriched with external data, we have everything needed to inform the creation of actionable incidents, or “Situations,” in the form of operationally significant groups of alerts. Sometimes, the required information may be incomplete. Sometimes the enrichment data may have been retrieved from multiple sources leading to conflicts in what should be canonical data — not ideal, but the sort of real-world problem that needs to be handled by IT management systems

The process of grouping alerts is generally referred to as “correlation.” The dictionary definition of correlation is “a mutual relationship or connection between two or more things which tend to occur together in a way not expected on the basis of chance alone.”

Correlation

In the world of IT Operations, correlation is often interpreted as the ability to make deep connections between seemingly disparate data, and while that is certainly part of the challenge, it isn’t the only challenge. What constitutes a relationship or a connection in the first place? Well, it’s anything the managing enterprise wants it to be, and what is meaningful in one organisation may mean nothing in another.

There are certainly failure scenarios where the correlation need is universal across all organisations — correlating link-up and link-down pairs to identify a flapping interface, for example. But there are many use cases in which the approach to correlation chosen by one enterprises may not align with how another needs to manage their infrastructure.

Consider an ISP that chooses to manage their street-level access infrastructure based on network topology. What about the retail bank that wants to manage its branches and Automated Teller Machine network based on street address? Or the Web Service Provider that finds the most efficient way to manage their infrastructure is to group alerts in a way that mirrors the remedial steps that its operators need to take, even though the failures may be on disparate parts of its infrastructure and share no form of topological or geographical proximity.

The core events across all of these use cases will be very similar, maybe even identical. But there isn’t, yet, a one-size-fits-all algorithm that can understand that these otherwise identical alerts need to be handled in a certain way in one organisation and a completely different way in another.

Consequently, and in order to address the wide-variety of operational methodologies across different enterprises, AIOps use multiple different criteria to correlate alerts, criteria such as event arrival times, network topological-proximity, and contextual similarity between combinations of alert attributes.

In AIOps we call the processes responsible for finding the connections between alerts and for creating situations “Sigalisers”. A single instance of AIOps can run multiple types of Sigaliser and multiple instances of each Sigaliser concurrently.

Where necessary, events can be routed along different processing paths. This allows different sigalisers to process different events depending upon each event’s characteristics. For example, “Availability” events from the core of a network may be processed independently of other events by the time-based or topology-based sigalisers, while “Application” and “Security” events may need processing together by a contextual similarity based sigaliser.

What About the Machine Learning?

All of our sigalisers use machine learning in some form, whether via an unsupervised clustering technique alongside a fuzzy matching algorithm, or algorithms that learn from user interaction.

“Tempus,” “Nexus,” and “Speedbird” are all examples of sigalisers that rely exclusively upon unsupervised machine learning.

Tempus correlates alerts based on time, grouping alerts with similar event arrival patterns. At its core are community-detection algorithms borrowed from the world of graph theory. Tempus requires only a single piece of data for its operation: the event arrival time. It takes no account of, and has no need of any other event attributes. The sweet spot for Tempus is availability-related failure scenarios in which all the different failure events are likely to be coincident in time.

For topology-based use cases the sigaliser of choice is “Nexus,” and so perhaps unsurprisingly, it requires access to a topology database. Nexus clusters alerts based on where they are in the network, and can only cluster events from entities within that topology.
“Speedbird” uses contextual similarity as its correlation criteria, grouping events based on the similarity of one or more event attributes such as description, or severity, or any other data enriched into it.

Both Speedbird and Nexus utilise a proprietary, unsupervised, clustering engine based upon the well-known ‘k-means’ algorithm. One of the perennial challenges with k-means clustering is the need to supply a value for ‘k’, the number of clusters the algorithm looks for. AIOps uses a patented way of determining that information, so it can automatically adapt to the inbound event data.

At the opposite end of the machine learning spectrum is our “Feedback” sigaliser. While Tempus, Nexus, and Speedbird use unsupervised learning, Feedback uses supervised learning. It looks at the actions performed by an operator during the incident resolution process: Did the operator need to remove outlier alerts? Were there some related alerts that needed to be added to the situation? Was the situation given a 5-star rating, or marked as being of low-quality? These are the training triggers that AIOps can take and use to inform the creation of future situations.

Pure unsupervised and supervised techniques provide great utility, but there are circumstances in which they may not be the optimal choice. As we’ve seen in the three other posts in this series, the power of unsupervised techniques is to find hidden structure within a dataset. But what if that hidden structure doesn’t align with your management strategy all of the time? Tempus and Speedbird are highly capable at generating seed situations that Feedback can refine, but as we learned in earlier posts, these are statistical methods that rely upon the quality of the training data they are provided with. “Garbage In, Garbage Out” as the saying goes.

This is where ACE fits. One perspective on ACE is that it is as a hybrid method somewhere between supervised and unsupervised learning. The thing is, by the letter of the definitions provided in data science textbooks, ACE isn’t really supervised learning — we don’t provide the system with labelled data — but at the same time it is so much more than a pure unsupervised technique. We can give ACE hints on how we want it to behave, so “guided learning” or “guided intelligence” are perhaps better descriptions of its capabilities.

ACE uses unsupervised clustering algorithms bespoke to AIOps. ACE is a novel streaming-based clustering algorithm, but the criteria used to assign an event to a cluster or situation are controlled by simple, similarity-based directives that use fuzzy matching techniques to find the most appropriate group of events to match it with. Something that it is impossible to do with a standard unsupervised clustering algorithm.

And One More Thing…

We have tracked our alerts through the management process, our alert has been ingested, de-duplicated, and entropy-based noise reduction has allowed the meaningless alerts to be discarded. And alerts meeting the requisite correlation criteria have been grouped into situations. But there’s one more application of AI to consider in the fight against alert fatigue — a feature called “Probable Root Cause,” or PRC.

A situation may contain only a handful of alerts, it may contain many tens or even hundreds. And of course, while AIOps has removed the meaningless events and collapsed potentially thousands of events into a single actionable incident, operators still need to know which alert to fix first, and that’s where Probable Root Cause comes in. PRC is the call to action, the way to inform operators of the alert they need to focus their attention on.

Probable Root Cause is an application of supervised machine learning, specifically a classification problem solved with a neural network.

By providing feedback during the incident resolution process, AIOps learns those alerts that are root causes, and the circumstances in which they happen. It learns the circumstances under which a humble ping fail is the root cause, but also, when different alerts have been triggered, that the same ping fail is now only a symptom — a symptom of a power supply failure, for example.

By utilising supervised learning, PRC isn’t dependent upon the static behavioural models that legacy root cause analysis systems rely upon — behavioural models that are slow to adapt to rapidly changing virtualized infrastructures, models that can’t handle common infrastructure implementations such as overlay networks. PRC is not constrained to a predetermined model of an infrastructure or the device types therein, and importantly, it can learn the way an organisation chooses to manage its infrastructure.

Significantly, PRC also has a capability that other approaches to root cause analysis simply cannot hope to emulate. It can tap into the tribal knowledge buried in the minds of your operators, the knowledge built up over years of managing your infrastructure, knowledge than remains in your organisation even when your key personnel move onto pastures new. Not only does PRC, AIOps’ machine-learning approach to root cause analysis, contribute to the fight against alert fatigue, but it also informs the remediation process.

This brings to a close our series on “Understanding the Machine Learning in AIOps” and how we use it to improve the efficiency of your management processes. The power of machine learning is unquestionable, but it isn’t a silver bullet — different algorithms have different sweet-spots. That applies across the board, whether you are using machine learning or an algorithm based on pure logic. And that is where the art lies, and what AIOps achieves — knowing the optimal approach to solving a problem so you don’t need to.