AIOps, so what is the big deal?
I have heard some less well-informed people describe AIOps as a “nothing-burger”, which is so far from the truth as to be laughable. The compulsion to avoid outages in today’s post-digital transformation world couldn’t be higher, and if you think that legacy tools can handle the real-time assessment of root-cause in the modern cloud-native applications we all depend upon your need to “get back to the future”!
The fundamental reason to bring AIOps into your life is to improve your customer experience. AIOps does this by using artificial intelligence and machine learning to boost your application performance and availability. That’s quite a “burger” wouldn’t you say? And, just to put a number on it, for Amazon a single minute of downtime represents $214,992 of revenue, a number sure to only get bigger. Given that Amazon is basically a web application that is manifested as thousands, perhaps millions of microservices and component parts that change and swirl second to second, trying to keep track of that with rules or manual searching of logs is an impossible task. Whatever the scale of your digital presence the challenge for you will be no different. AIOps is the automation boost to your existing monitoring tools to make all of that manual work for your DevOps and IT Ops teams just go away.
I’m sure you need convincing, after all, burgers aren’t for everyone! So let’s step through a handful of use cases and see how this all works in action!
1. Noise Reduction: Millions and billions of metrics and alerts – where’s the outage?
The simple fact of monitoring is that we over instrument our applications and infrastructure and do not build for observability. What that actually means is when we get an alert, event, log message, or metric you have no way of knowing if it is important or not. So when dealing with such large volumes of data, performing event correlation or anomaly detection is no easy task. This is precisely where the AI in AIOps helps (spoiler alert, most AIOps vendors don’t actually have any AI). Whereas legacy rules-based systems scale badly and groan under high load, true AIOps platforms actually improve with more data. The models, algorithms and science in Moogsoft’s AIOps solution are built for scale and improve with it. AI and big data go together.
So how does this work? By looking for anomalous patterns you can detect when a subset of the observability data you are receiving is out of the ordinary and begin the process of problem remediation. This does not need you to anticipate in advance which events, metrics, logs, or traces could change, you just look for the change. The power of this approach is that the system is able to sometimes get to the root cause before the user notices the website is down and starts hitting up your competitor!
2. Early Detection: Action now or embarrassment later!
Most AIOps systems only handle events or metrics. This is nuts! It is simply the case that systems do not fail out of the blue. More often than not what starts as CPU or a similar metric looking odd compared to historical data soon becomes a red light in your alerting, and of course, change events are often a culprit. Time series, changes and alerts go together like ham and eggs, and in the Moogsoft platform, we do just that! This allows us to pick up the early signs of problems with automated anomaly detection in metrics, and with tight integration to build systems via our APIs for changes, we know when builds are pushed or maintenance is being done. Better than that, because our correlation doesn’t rely on rules we evolve our incident understanding as more evidence from changes, alerts and metrics arrive rather than having to wait for all the symptoms to be there before we declare a problem. This can shave valuable (and we mean valuable) minutes of the problem management process.
3. Root Cause Analysis: The building is on fire, which flames do I douse?
You know the old saying, change is the only constant. Well back in the day change happened at a genteel pace over weeks and months. This is no more! A typical SaaS service is built using techniques of DevOps like CI/CD (Continuous Integration/Continuous Delivery), which means change is continuous. Even here at Moogsoft, our engineering teams upgrade every instance of our customers’ production platforms dozens of times a day with zero impact on user experience – they literally don’t notice.
With all of that change and complexity when things go wrong knowing where to start and what to fix is key. You need to know which fire to fight first; not engage in time-consuming diagnosis and trying to work out who you need to call on in your operations teams. AIOps changes all of that.
At Moogsoft, we have deployed over 40 patents into the science behind our correlation. Because of our unique approach to correlation, we can add context automatically to alerts, metrics and changes indicating which services are suffering. Then, using correlation that spots patterns in data a human could miss we connect together symptoms from seemingly disconnected sources and bring the evidence to your teams as it happens, not after the fact!
4. Automated Incident Response: Is it possible for my IT teams to die of alert fatigue?
Well no, but they may up and quit on you. What is being called the great resignation by Harvard Business Review highlights that burnout in the tech industry is behind unprecedented levels of skilled individuals changing jobs.
It is a fact that monitoring systems produce a never-ending stream of alerts requiring your DevOps teams to manually process, adding unproductive minutes to the MTTR of your average incident. AIOps tools, such as Moogsoft, use algorithms of machine learning algorithms, and in our case information science, to automatically choose what to do with an incident once we have detected it. This can include working out a non-issue and quietly auto-resolving it before unleashing the team, routing automatically to the correct team of tech ninjas who just know how to get the app back on track, or kicking off a trusty remedy for the problem in orchestration systems.
So instead of paging your teams for every false alert and badly correlated incident, you can be sure that when Moogsoft creates that notification having dealt with most of the issues automatically it’s worth the response. And we know that fixing a real problem is what the SRE and DevOps teams really know how to do…
So, you can see AIOps turns alert fatigue into job satisfaction. That’s the fastest way for you to buck the trend in the great resignation!
5. Single Pane of Glass: All roads lead to Oz, but which one gets me to the wizard fastest?
Often times the mushrooming of different tools to manage different parts of your services can make the day job of an SRE team seem like an escape room. Starting the process of incident management so often begins with choosing the right console to start the diagnosis.
Because AIOps, if done well, is pretty agnostic when it comes to the source of data it can become your principal system of engagement. With flexible APIs, tightly integrating the system into literally thousands of sources of changes, alerts and metrics, whilst automatically updating incident management systems you can transact every incident from start to finish without leaving your AIOps tool. At Moogsoft we pioneered collaborative incident management and this is key in avoiding the confusion that comes with tool proliferation. Now, we can’t promise a wizard behind the curtain (spoiler, our wizardry is in the algorithms), but our tool will guarantee you get home faster than Dorothy as you whizz through any outages that you encounter!
So there it is, the A-B-C of AIOps use cases. Now for sure, there are many more, and when Gartner coined the term AIOps as a shorthand Artificial Intelligence for IT Operations, they certainly could not have anticipated the impact it would have on the user-experience of us all in the digitally-transformed world we all live in.
So much for the “nothing-burger”, AIOps may just be the only double triple, an animal-style whopper that’s actually good for you!