Knowing where to focus your resources in a time-critical environment is key to achieving a successful outcome. When your primary IT services are performing badly — or worse still, suffering from a complete failure — you want your resolution team spending more time fixing the problem and less time identifying it and searching for the needle in the haystack.
Moogsoft AIOps uses a suite of algorithms that combine to allow your team to speed-up the resolution process so your services get back online sooner. Some of those techniques make it easier to know what to work on first; they give you directions to where the needle is, if you will.
Other algorithms have impact at the other end of the problem, reducing the volume of data that needs to be sifted through, making the haystack smaller. The key step in this part of the process is Noise Reduction, the ability to remove the unimportant events, reducing the volume of data that the Incident.MOOG algorithms need to analyze and, more importantly, the volume of data that the people actually fixing your outage need to look at.
Reducing the Noise
For many years the concept of noise reduction began and ended with deduplication: every time a repeat event is encountered, you increment a counter on the parent alert, and discard the repeated event. Hundreds of ping fail events collapse to a single alert. Simple and effective, but no longer sufficient when managing modern systems.
In Incident.MOOG, the concept of noise reduction still begins with deduplication — it is, after all, a wonderfully simple idea — but Incident.MOOG goes a lot further. Every event that enters Incident.MOOG is analyzed and is assigned a numerical value, a value that indicates how important that event is within the context of the rest of the system. In Incident.MOOG we call this attribute Entropy. The higher an alert’s Entropy the more important it is; the lower the Entropy, the less important it is. High Entropy events are the needles, the things to examine first; low Entropy events are the events that can be safely ignored, the noise, the haystack. Even with a basic Entropy threshold, large proportions of the inbound events can be ignored because they don’t contain useful information, and importantly, without losing the events that need remedial action.
What is Entropy?
Entropy is a term that is used in a variety of scientific and engineering fields, and has its root in thermodynamics. In the field of information theory, entropy, or ‘Information Entropy’ as it is more formally known, is a concept created by Claude Shannon. There is a story that Shannon was discussing his theories about “lost information” with John von Neumann, and specifically what to call his new concept.
“You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage”
What Entropy Really is…
Put yourself in the place of your network operations engineers, where deduplication is the only mechanism for noise reduction. You still see thousands of alerts every day. Through experience and tribal knowledge you know which alerts are of no consequence and that can be safely ignored — the process heartbeat messages, the polled and un-thresholded CPU utilisation messages, the temporary network connectivity failures. None of these alerts need remedial action, they contain little useful information, they have a low Entropy. But can these alerts be distinguished from the important, actionable events? The failure of the disk array on your DB cluster, for example, that can’t be ignored, that needs action.
In Incident.MOOG, the Entropy of an alert has multiple components, for example… What is the text of the alert? When does it appear and how often? Where is it coming from? We call these components the Semantic, Temporal, and Topological Entropies, and they combine to form an overall measure of Entropy for the alert.
Semantic Entropy is derived using Natural Language Processing techniques. Words and phrases are assigned a score according to how common, or rare, those words are. Combining those scores gives a value for how much information is contained in the text of the message. But that’s not the whole story. An alert that always contains the same text and that appears every few hours carries far less meaning than a similar alert that appears once every few days. This is where Temporal Entropy comes in. Randomly occurring alerts carry more meaning than frequent and regularly occurring ones. Finally, there is the concept of Topological Entropy, a measure of importance derived from where in your network an alert comes from. Is the alert for a development server, or from a switch at the core of the network? An alert from the former is likely to have a lower Topological Entropy than the latter.
Of course there is some pretty complex math going on behind the scenes to calculate the values for the different types of entropy. But the underlying concept of an Alert Entropy is a simple and incredibly powerful model for noise reduction, far more powerful and sophisticated than the straightforward act of deduplication, and far more relevant to modern IT operations.