It is a truism to say that most decisions in business are driven by financial concerns. The value of the business is determined by that margin between the cost of doing business and the value that customers will pay. In turn, businesses invest to improve that margin: to serve more customers better, to offer them new services, and to do so sustainably.
IT is no different. Despite being a fairly high-profile function, investments in IT still need to satisfy the same cost / benefit calculations as in every other aspect of the business.
In the early stages of discussing a potential AIOps project, I always try to understand what the person sitting across the table is actually trying to achieve. When I ask why they are planning to deploy AIOps, their first answer in invariably to reduce cost – specifically, the cost of operating IT services.
Cost reduction is a valid goal. If we make IT cheaper, we can do more of it and do it better. We can give end users and the business more of what they need. However, there are alternative ways to articulate our primary objective, however, and these each have radically different outcomes.
A few examples of AIOps drivers include:
- Avoiding critical IT incidents
- Keeping IT operations humming
- Supporting business growth
Let’s examine each of these.
1. Avoiding Critical IT Incidents
Businesses are utterly reliant on IT, and any disruption to IT services will also have huge impact on the business. Try visiting any office when the power is out or the internet is down. Even in organizations that still produce a lot of paperwork, all of those printers are still reliant on distant servers and the networks that connect them.
Consequently, the cost of a major IT incident can be massive. Moogsoft co-founder Mike Silvey tells a story about how Moogsoft AIOps averted an outage at a major airline:
Last year, an American airline suffered a six-hour outage. Its ground-handling software failed, and it could not schedule a single flight for that period. This year they have Moogsoft. Our software detected an incident that they could act on earlier, and then resolved before it impacted their ground-handling system. Detecting that previously unknown situation prevented an outage of flights across the U.S. that would have lasted a minimum of 4.5 hours.
It’s worth remembering that any projected impact is probably still a conservative estimate, as it does not attempt to quantify the long-term brand impact. People often prove unwilling to become or remain customers of a brand or business that they perceive as unreliable.
Avoiding even ONE major incident like the kind the airline suffered can therefore pay dividends. Enough to pay for quite a lot of AIOps software (shameless plug alert!)
2. Keeping IT Operations Humming
Even absent major incidents, it’s still not cheap to run IT Operations at a major enterprise. Ever since there have been computers, there has been a drive to automate and standardize IT work in order to reduce running costs. Do some of these past automation drives sound familiar?
“It takes a long time to install a physical server from scratch, so let’s figure out a way to boot from the network instead, pulling standardised images from a central location.”
“Say, what if we virtualized the entire server and spun it up only when the application needs it?”
“Even better, let’s architect applications to be as independent from the underlying compute infrastructure as possible.”
Despite all of this progress, there’s still the need for a room of people staring at screens, watching for something to turn red or a system to do something anomalous. All of those people and computers and networks and virtual servers cost money. Being able to reduce all of these associated costs – without causing any IT incidents, of course – would be a big win.
Cost is relegated to a critically important metric, that is, a way to measure the achievement of these objectives but not the goal in and of itself.
3. Supporting Business Growth
IT is always ready to fight the last war. We are forever prepared to prevent a recurrence of the last incident. What IT is not always so well prepared for is change. The marketing department launches a new advertising campaign and enjoys a 50 percent increase in website traffic – but ooops, we forgot to tell IT! That sort of event can be a minor annoyance or a complete catastrophe, depending on how prepared the IT team and its systems are for unforeseen curveballs coming their way.
IT Cost as a Metric, Not a Goal
Here’s the key point to bear in mind. Cost reduction is the main driver in precisely none of the three key AIOps scenarios outlined. The actual objectives are stability, efficiency, and agility. Cost is relegated to a critically important metric, that is, a way to measure the achievement of these objectives but not the goal in and of itself.
That being said, it is vitally important to have a metric, i.e. a way to measure what is to be achieved by any AIOps project, and whether that goal has been reached. But it is also key not to confuse the goal with the metric.
Why? Because some of these goals are going to be difficult to quantify. Others may lead to short-term decisions that do not adequately consider the long-term results and impacts.
Why Tracking IT Cost Reduction Is Not So Easy
Incident avoidance is one obvious AIOps use case. But how do we quantify incidents that never occur? What if the incident we thought avoided was never going to happen anyway? It’s very rare to see a clear-cut situation where an incident was absolutely going to occur. The best that we can normally say is that, based on prior experience, the probability that an incident would have occurred is high. Correctly apportioning the blame and praise for its occurrence (or its lack thereof) is liable to be a fraught exercise.
Tracking the daily cost of IT Operations would seem to offer a better chance of accurate “before and after” comparisons. The difficulty here is more a human one, however. The main way to reduce routine running costs is to take people out of the system. Don’t know about you, but I wouldn’t sleep particularly well at night knowing I got a bunch of people fired. Luckily, that rarely happens. In many countries, labor laws prevent companies from eliminating people’s jobs in a hurry. As a European, I’m generally in favor of this idea! However, it does make it difficult to point to a substantial savings in salaries as justification for an AIOps investment. Moving people to other departments (and their associated cost centers) is harder to recognize as “savings”.
So it’s better to focus on goals like stability, efficiency, and agility. The idea is to make the IT operations as predictable and invisible as possible, so that business functions that rely on IT support can easily factor it into their own planning.
One unglamorous parallel here is the Facilities department. It is able to provide whatever physical building services are required for all employees, given some fairly minimal high-level requirements. The Facilities team is not expected to go with lowest cost bidders every time. No one expects them to mix their own cement or lay their own carpet. Depending on the physical plant requirements, a variety of different approaches may be appropriate, as long as they satisfy their ultimate goals. IT Operations should operate in the same way.
For almost every enterprise out there, IT is no longer a daring cutting-edge innovation, but routine table stakes. Everyone has IT, so the competition has moved up to the next level. Namely, whoever can operate their IT systems better, wins. This mean ensuring that they are stable, efficient, and able to react with agility to whatever may be required.
Putting the Emphasis on “Ops” in AIOps
As a field, AIOps is pretty cutting edge. The “AI” part is the giveaway there. It’s true that there’s a lot of overheated hype around these two little letters, causing some justified scepticism. However, there is some very interesting work being done on Moogsoft algorithms. Specifically, the Moogsoft AIOps Platform’s ability to sift faint signals from enormous amounts of IT operational noise, identify correlations between seemingly disconnected events, and assemble cross-disciplinary teams to respond to identified IT incidents.
However, all of this is in service of the fairly prosaic goals of continuous assurance of IT operations, which have not changed much since the dawn of computing. Namely, to keep everything running, to avoid outages, and to not spend too much doing it!
AIOps projects are only worth getting into if they satisfy one or more of the basic business needs of stability, efficiency, and agility. Cost is a good way to measure the size and success of an AIOps project, but it’s only one metric. Don’t let that obscure the actual, higher goals that you are supposed to be working toward.