The important parts of DevOps are not about creating fancy new tools or systems. In fact, if anyone attempts to sell you a tool to “do the DevOps,” hurry in the other direction (after first checking that you still have your wallet). Rather, DevOps is about putting together activities that were being performed already, but that had been isolated and walled off from each other.
An interesting story in this vein is the development of the Limpet Mine (mk 1). Seriously, read the whole post—it’s short—but here is the kicker:
Stuart Macrae bought every washing-up bowl, every aniseed ball, and every condom he could find. That became the “Limpet Mine (mk 1).” How did it work, this combination of washing-up bowls, aniseed balls, and condoms? Well, in just one night, 14 Marine Commandoes sank seven Japanese ships using those limpet mines.
Yes, really — the guy made anti-shipping mines from washing-up bowls, aniseed balls (a jawbreaker-like candy), and condoms.
How Does This Apply to IT?
There is actually a very real parallel to IT in general, and to Ops in particular: In modern IT, we stand on the shoulders of giants. Nobody builds their own toolchain; we all rely on common libraries built by others. Some say this has gone too far, especially when one programmer deleting 11 lines of code can cause widespread outages — but realistically speaking, where do you draw the line? No third-party libraries, but you can use what ships with your framework? No frameworks, just what comes with the language? How high-level can that language be? Are we allowed Javascript, or does it have to be C? Why stop at C? Let’s do everything in assembler!
Of course not.
What we do is to take existing components off the shelf and assemble them in a creative manner to build something new. Why take time and go to the effort and expense to build an aluminum dome, when you can buy a perfectly serviceable one right now? The value of the limpet mine is not decreased by the humble origins of some of its components; in fact, this only enhances its value, not to mention underlining the ingenuity of its creator.
There is a lot of value sitting on the shelf in IT Ops rooms, too. Each technology domain, each team, has its own tool to track and monitor one specific aspect of the status and performance of the system. All of this valuable information is sitting there, shown on screens all around the room, not to mention the big status board at the head of the NOC.
The problem is that nobody is putting these potentially valuable ingredients together. When a connection is made between domains, it is usually the result of a laborious and time-consuming process of reassignment, escalation, and blamestorming. Ultimately, it relies on the skill and experience of a small number of long-serving team members who are able to make non-obvious connections between pieces of information that are already there.
IT departments cannot rely on having a Stuart Macrae show up every time he is needed to put candy in a washing-up bowl. If there is only one person (or only a small number of people) who can make that connection, sooner or later something will occur when nobody is available to do so.
Fortunately, it turns out that computers can be taught to spot those patterns as well. First-generation event management systems used hard filters and deterministic rules to identify important pieces of information and put them together into a higher-level view. Today, a new breed of operations tools is emerging, using algorithms to spot those patterns and correlations between existing pieces of information.
Alert Fatigue
The problem is not that Ops has too much information. Most Ops teams are drowning in alerts. Alert fatigue is a real issue. So the solution is not to add even more monitoring.
The old rules-based approach is also showing its limits, as the race to keep up with the ever-increasing complexity and accelerating rate of change in IT becomes unwinnable. So, the solution is not to throw good money after bad in that pursuit.
The way forward is to adopt learning systems that can sift existing information for valuable connections and learn on a continuous basis what is expected and what is abnormal. This is how IT Ops can detect, diagnose and remediate problems faster.
This is how you blow up battleships: with washing-up bowls and ingenuity. In IT we have plenty of ingenuity and plenty of commodity components; we just need to use one to assemble the other.