How to Fix Problems You Don’t See (Yet)

It’s not an easy life in IT operations. Sometimes it can feel like your reward for fixing one problem is that two or three more take its place in your queue. The big screen at the head of the room always seems to be just a sea of red, each line a high-priority alert that needs to be dealt with as fast as possible before users show up at the NOC with torches and pitchforks.

The problem is compounded by the ever-increasing complexity and rapid rate of change in modern IT, which together make it extremely difficult to get a good overall picture of the current state and performance of the systems and the services that they support.

Sometimes, IT ops teams can end up like the proverbial drunk, searching for his keys under a streetlight. When questioned by a well-intentioned passerby about where he lost them, he waves his arm vaguely off in the darkness. Puzzled, the would-be helper asks, “if you lost the keys over there, why are you searching over here?” The drunk replies: “because it’s dark over there, and here I have light!”

Ridiculous, right? And yet ops teams are often reduced to doing something very similar, trying to extrapolate from the one good set of data that they have, to understand what is going on in other areas of the system.

Look Past the Obvious

Something very similar happened during the Second World War. The US Center for Naval Analyses was trying to determine what they could do to prevent their bombers getting shot down, or at least reduce the numbers that they were losing. As a part of this exercise, they undertook a study of the patterns of damage on aircraft returning from sorties. Logically enough, their recommendation was to increase armor coverage to the parts of the fuselage that had suffered the most damage.

However, a statistician named Abraham Wald suggested something different: the armor should instead be added to the areas that had been left undamaged. His reasoning was that the planes they were able to examine were the ones that had taken damage to non-critical areas. The proof was in the fact that they had been able to make it back to base despite that damage.

On the other hand, there was a whole other set of bombers which the team were not able to examine, which had taken damage that had destroyed them outright or forced the crew to bail out. Therefore, the critical areas that had remained undamaged on planes which had returned to base were precisely the ones that, if damaged, would prevent them from returning to base at all.

We need to apply the same kind of thinking in IT. Just because we can’t see errors in the particular bit of the service that’s in front of us, doesn’t mean that the service isn’t in trouble. More subtly, a problem in one area may also be manifesting in a completely different area.

Filtering a Red Sea

We also need to consider what our assumptions might be hiding from us. Let’s return to that undifferentiated sea of red that the Operations team are trying to make sense of. The obvious first step to try and keep the number of alerts to a manageable level is to filter them, discarding anything considered less important or urgent. This does cut down the sheer volume of alerts hitting the Operations team…but what information is being lost?

The problem is that simplistic filtering — “only show me alerts above a certain severity,” “only show me alerts from this system,” “look for these specific error messages” — risks missing out on early indications of a developing problem, or useful context on one that has already been identified but is still being diagnosed. This is the IT equivalent of looking under the streetlight, or patching only where there are already bullet holes.

What if, instead of this overly naïve filtering, Operations had a reliable mechanism to identify significant, relevant alerts, regardless of their severity rating, origin, or whether they matched a known problem? Now we are looking at where they keys actually are, and we have a chance at helping more aeroplanes get back to base safely.

If you can pick up a developing issue before it crosses the threshold into a full-blown problem that affects users, you just bought yourself the time to fix it before users are even aware of the issue. At that point, from the user’s point of view, the problem never occurred in the first place.

Piecing It All Together

However, these days problems are rarely confined to a single area. It’s more normal to see issues whose ultimate cause may be in one domain, but that causes impacts and is reported across many others. This is why it is crucial to be able to correlate data from multiple different domains. It’s important to know which user transactions are running slow or which infrastructure elements are getting overloaded, but without the ability to put those two pieces of information together, it’s going to be very tough to find and resolve the problem.

When IT was simpler and computers were physical boxes that stayed put, it was easy to keep a general picture of these mappings in your head. As more and more systems and connections were added, those rules of thumb were codified in rules and filters. Over time, though, complexity and rates of change have exploded, to the point that no static model — whether in operators’ heads, or implemented in software — can keep up.

This is why today IT operations use algorithmic approaches to sift and correlate the vast reams of information available to them into a coherent picture of what is actually going on in their environment. That understanding is what helps detect, diagnose and resolve problems, all before users even notice anything wrong.

That’s the sort of seamless quality of IT service that is needed as IT becomes more and more pervasive in our personal and business lives.