How AIOps Can Enhance ITIL

An ITIL Primer

There is an assumption out there, especially in the world of DevOps and Agile, that ITIL and ITSM are obsolete topics that are no longer relevant in this enlightened age. I think this is a short-sighted view. As George Santayana famously said, “Those who cannot remember the past are condemned to repeat it.” To paraphrase him, we could say that those who fail to understand the reasons for ITIL’s existence are condemned to reinvent it…poorly.

Recent events with helpdesk practitioners and conversations with leading lights of the ITIL world have brought ITIL and ITSM to mind once again. As it happens, I encountered Agile in the form of Extreme Programming nearly two decades ago, some time before I became aware of ITIL, so I have some perspective here.

At the time, I was the (very) junior member of a sysadmin team, responsible for a server room that supported the work of a couple hundred developers. This was before virtualization was widespread, and certainly before the horsepower to run VMs locally on engineers’ desktop machines was generally available.

There was constant tension between those of us in ITOps and (some of) the developers, as they would ask for (or even demand) root privileges on our systems. For our part, hard-earned experience made us very reluctant to grant this access, because restoring systems that developers had comprehensively wedged was an annoying and time-consuming chore in those days. It required the most junior member of the sysadmin team (that would be me) to trek to the tape vault across campus, and then spend long hours in a loud and freezing cold server room, feeding backup tapes to the jukebox. On the plus side, this experience did lead me to an early interest in automation and remote management, which would serve me well later on.

Ideally, we want to avoid treating users as the first line of defense, but all too often, the first indication that reaches IT is when a user calls up or emails to report that something is wrong.

ITIL was created to address precisely these sorts of issues, by defining best practices that could be used as a reference by organizations to implement their own processes, and help create defined standards that everyone could refer to. In the intentions of its creators, it was supposed to address people, processes, and technology — specifically in that order. Unfortunately, as it became more popular, the order became reversed: technology first, then rigid processes based on the tech, and people an afterthought (at best).

For a while, this was not too much of a problem, because the world of IT remained relatively static and unchanging. Sure, there was frantic innovation in terms of faster processors and new operating systems, but the overall model of IT remained much the same. In programming terms, we could describe ITIL by analogy with a waterfall model. This is a sequential approach, articulated into various phases:

System and software requirements: captured in a product requirements document
Analysis: resulting in models, schema, and business rules
Design: resulting in the software architecture
Coding: the development, proving, and integration of software
Testing: the systematic discovery and debugging of defects
Operations: the installation, migration, support, and maintenance of complete systems

The classic waterfall model describes software engineering, with ongoing operations and maintenance tacked on as an afterthought. ITIL applies the same sort of sequential thinking to articulating operations into its own constituent processes, mainly Service Delivery and Service Support — or nowadays, Design, Transition, and Operation. In particular, Incident Management has its own sequence of steps:

Identification
Logging
Categorization
Prioritization
Diagnosis
Escalation
Resolution
Closure

As a best practice to use in designing real-world processes, this is pretty complete. As long as organizations avoid the trap of targeting compliance with ITIL as a goal in itself, this model works well to standardize the activity of incident management, and make it predictable and manageable.

However, it does show its age in a few areas, partly because of software engineering’s own move away from monolithic waterfall processes, and toward more agile and iterative approaches. As the results of that process start to affect more and more of IT Operations, it’s time for an equivalent evolution to happen when it comes to the production operation of systems and applications.

Identification: Known Failure States

The first area to look at is the very first step: identification that an incident is occurring. This can generally be divided into identification by users and by automated processes. Ideally, we want to avoid treating users as the first line of defense, but all too often, the first indication that reaches IT is when a user calls up or emails to report that something is wrong.

The reason this happens is that it is hard to detect a problem automatically. Historically, monitoring used to mean actively looking for known indicators of failure: resource usage exceeding a certain percentage, particular error messages appearing in logs, and so on. Each of these could then be investigated as an incident. As the complexity and the rate of change of an environment rise, this approach becomes less and less useful, as there are more failure indicators generated than can possibly be investigated by IT Operations staff.

Predictably Increasing Unpredictability

Another place where the old sequential models break down is the association of an incident with a single root cause. The assumption used to be that incidents would have one (and only one) root cause. If that was addressed, the incident would be resolved and would not recur. The rising complexity of IT infrastructure means that most incidents are not caused by a single failure, but by a number of different failures all occurring together. This, in turn, has another consequence, as increasingly specialized IT Operations teams struggle to assemble a coherent picture across technological and organizational boundaries.

Because all the simple, predictable cases have been addressed — either through structural changes or thanks to the recent ubiquity of distributed, fault-tolerant architectures — there are more and more incidents that have never been seen before, the causes of which do not fit neatly into any one area of responsibility. This can lead to duplication of tickets, to repeated reassignment of tickets, and to unnecessary escalations, all of which add to the duration, and therefore to the impact, of the issue.

Change Is The Only Constant

AIOps aims to update the model of IT for this new, highly changeable state of affairs by integrating new technology, updating and complementing existing processes, and putting people back at the center of IT Operations.

This is not a replacement for ITIL, but a complementary approach that helps organizations update their own processes for the new realities of enterprise IT.

One of the key changes that the shift to Agile development brought is that release-to-production is no longer a discrete moment in a service’s lifecycle, with a single massive artifact being transferred from one team to another, together with the attendant responsibilities. Instead, developers make very frequent smaller changes, perhaps even multiple times per day. In addition to these manual changes, the infrastructure is also changing itself automatically in response to various conditions.

In other words, the only constant in IT is constant change. Whatever processes are used in IT Operations need to accommodate this fact of life. One of ITIL’s cherished assumptions is that there exists somewhere a single source of truth, a CMDB. By referring to this authoritative source, users can determine the correct state of a system, its owner, and its relationships with other systems that may help diagnose a fault or determine its impacts.

This complete and up-to-date CMDB has always been more of an aspiration than a fact, but the reality has fallen farther and farther behind the ambition. As the volume and pace of change continue to increase, more and more information is too transient ever to be documented in the CMDB or in its attendant knowledge base, recording past problems with their solutions. This means that when a problem does inevitably occur, it is less and less likely that its solution is readily available, or even that useful information can be found to aid in a diagnosis.

AIOps replaces this assumption of a static, documented world – which never really held true – with the recognition of constant change, relying on algorithms and machine learning to understand what matters, present it in context, and help IT Ops specialists diagnose and resolve issues quickly.

Algorithmic Visibility — See Only What Matters

Modern IT infrastructures are incredibly noisy. Some Moogsoft customers are processing over a billion events per day. Obviously that sort of volume can never be managed by any process which requires human evaluation. Equally, any definition of what is or is not an interesting event based on past experience is bound to fail as the environment changes, and in any case requires constant effort to maintain the filter. Instead, patented automatic noise reduction ensures that only relevant events are even considered for analysis by busy human specialists, whose time is much better spent elsewhere. This avoids the creation of irrelevant tickets which, while they can be dismissed out of hand, distract operators and waste their limited time and attention for no good reason.

See The Whole Problem — Shared Understanding

Because both the causes and the impacts of an incident may be spread across multiple domains, it is not practical to try to identify a single owner of the issue. Instead, Moogsoft’s Algorithmic Clustering Engine is able to use a variety of different supervised and unsupervised techniques to cluster related alerts together into a single incident record. All of the owners of the affected areas then have full visibility into the entire scope of the problem, not just their own subset of the symptoms of the problem. In this way operators will not receive duplicate tickets and will be able to get to grips with the whole problem without having to reassign or escalate the ticket.

Swarming And Collaboration

Finally, those different specialists need to work together to solve the problem. Instead of playing a sequential game of “pass the parcel” with the ticket, Moogsoft enables operators to collaborate, assembling an ad-hoc virtual team of operators based on their particular skills and areas of responsibility. This approach is beginning to be recognized as “swarming” among ITIL specialists, but it is hard to implement when working solely in ITSM tools. On the other hand, Moogsoft’s collaborative Situation Room can be a System of Engagement, backed by the ITSM-aligned System of Record. Automated interfaces behind the scenes make sure that information is synchronized in both directions, ensuring full visibility of the same data, no matter where users look for it. This also means that existing ITIL processes will continue to operate as normal, but with the guarantee that any incidents created are real and actionable and can be worked on immediately.

ITIL’s roots stretch back decades and span across the globe. It is very far from being irrelevant – but processes built on past assumptions do need to be re-evaluated periodically. AIOps offers new technology that helps to bring those processes up to date with the new realities of enterprise IT.