Q&A with Moogsoft CEO Phil Tee: Machine Learning for ITOps

The addition of machine learning technologies to monitoring tools is a hot topic for IT Ops and DevOps teams. While there are a variety of use cases, the “killer app” for IT is how machine learning improves real-time event management to increase service quality for larger enterprises. It does so by automating the early detection of anomalies before customers notice, and then reducing the impact of production incidents and outages.

This exact use case is the topic of the Moogsoft webinar, “How to Reduce Production Incidents and Outages with Machine Learning,” co-hosted by DevOps.com.

In light of this webinar, I recently had an opportunity to sit down with Moogsoft’s CEO, Phil Tee, who has been pioneering the use of machine learning for IT monitoring tools since 2008. Read excerpts of our conversation below for Phil’s perspectives on this advancing technology and what IT leaders should know about its expanding role across IT operational management.

What is machine learning in the context of IT operational management?
I think of machine learning as a class of methods to analyze data — iteratively learning from the data – and then finding hidden insights without being explicitly modeled where to look.

In the context of IT operational management, it’s important to understand that the prior alternative to machine learning is the use of behavioral models for IT operational management. A behavioral model approach requires you to look at all of the components of your infrastructure in order to understand the potential ways that things can fail or degrade. More specifically, you try to determine certain patterns of events and alerts that match all of the conditions you wish to monitor for.

Most IT operational management tools fall into this category – whether it’s an old-fashioned legacy event manager, or even a more modern tool that uses an “aggregate & query” approach to IT Ops. Either way, you are configuring the tool to find something specific that you are searching for, and know to search for beforehand.

Machine learning, on the other hand, uses the data itself to surface interesting features, which may be completely unknowable ahead of time. Unsupervised machine learning, for example, can be used to analyze streams of events or log messages to look for anomalous clusters of messages. These anomalies can then be associated to an operational outcome, capturing the causes and symptoms of a potential failure.

Supervised machine learning, however can be used to record the activity of users in response to given alerts and clusters of alerts, adjusting the algorithms accordingly. Essentially, machine learning is utilizing data to continually create and update a behavior model on the fly, as opposed to using a static behavioral model to look for specific outcomes.

Why is machine learning important in this context, and why should IT leaders care?
All of the implications of IT digital transformation today – scale complexity, change velocity, software abstractions, brownfield to greenfield migration challenges – are the reasons why machine learning needs to be applied to IT operational management. It’s impossible to build behavior models when you have an infrastructure that’s in a constant state of flux. If you want to make sense of huge volumes of data coming out of apps and infrastructures, than a rules-based approach is dead in the water. In this new era of software, you have to apply machine learning to analyze data in real-time. This is essential for maintaining service quality today. IT is going to be even more hybrid, virtualized and fluid, and you need machine learning to get on top of these changes.

Can you explain how machine learning is being used in Moogsoft’s products?
In a modern IT environment, you need to process a huge amount of event data generated from a constantly changing infrastructure. Here at Moogsoft, we use machine learning to perform two specific tasks. The first task focuses on “noise elimination.” For example, how do we take tens of thousands of events per second and eliminate the noise while retaining the signal? The second task is used to group together anomalous groups of events into clusters or situations that have operational significance.

In practice, we collect events and alerts from many different sources, from machine logs and SNMP, to Twitter and customer support feeds. We routinely find actionable patterns in these events, often hours before customers are impacted, allowing IT Ops teams to fix these problems as they occur, before damage is done.

How is this different from the way that other tool vendors may be trying to incorporate “big data” analytical techniques into their products?
First off, Moogsoft doesn’t profess to have invented machine learning for IT. Both machine learning and AI have been around for more than 50 years as an identified area of research. However, most machine learning techniques focus on a retrospective analysis of data. As a result, most tools claiming big data analytics capabilities look at yesterday’s telemetry – spot a historical pattern – and then use that to predict tomorrow’s outcomes. This is relatively useless from our customers’ perspective given the rate of change and the proliferation of “unknown unknowns” occurring in today’s IT environments. Instead, we have innovated our machine learning to work in real-time – as things are unfolding – so IT Ops teams can get things resolved earlier, before customers or users are impacted. In parallel, we have pioneered our machine learning to work at scale, so the largest enterprises and service providers out there can also benefit from automated, model-less, anomaly detection.

What is the one thing you would tell IT leaders to consider when evaluating machine learning technologies in tools for ITOM?
First of all, be cautious of the machine learning or big data hype. For example, in the cyber-security world, the retrospective analysis approach is well acknowledged as flawed. However, in terms of IT Ops tools, businesses are making the transition to supplement and get rid of their brownfield infrastructure with a more cost effective way of delivering software on a highly virtualized platform. If you’re making a decision about new tools, be sure to ask yourself if the tool can cope with modern, dynamic infrastructures – there are a lot of tools out there where the answer is no.

In closing, can you comment on the direction of machine learning adoption in enterprise IT and how it will influence the future of IT services and applications?
If a monitoring tool isn’t already incorporating some type of machine learning, it has well missed the boat. I see an accelerated phase of adoption of data-driven monitoring tools now, and continuing for the next 3-5 years. The earlier adopters in IT are already creating competitive advantage with the deployment of this technology – the broader market is now getting educated as well.

There will be an expanded use of machine learning and related big data techniques across ITOM, as the ability to build and maintain an accurate CMDB from the bottom up is becoming next to impossible.

I also think that we aren’t far away from machine learning being able to automate mid-level intelligence tasks. This is starting to happen in tools we use in our daily lives. There are many more tedious and error-prone tasks in IT Ops that can be automated to improve quality. People are awesome at what they do best, but machines are best at making sense of the volumes of [IT] machine data.

More broadly, machine learning will ultimately influence the future of IT service and applications because old-fashioned systems are deterministic and can’t keep up with the increasingly dynamic nature of IT today. With rules-based systems, you now end up with more inconsistencies and ambiguities that you can possibly cope with – agile tools are needed for an agile world.