Observability is crucial for service assurance, but only scales with AIOps

To leverage IT innovations like cloud computing, containers and microservices, and to meet customer experience expectations, IT teams must monitor their applications and services differently.

The reason is that developers are deliberately disseminating information through their code in order to understand and manage the complexity in today’s ephemeral and dynamic environments. The goal is to make the services that Site Reliability Engineers (SREs) monitor and operate more observable and ultimately reliable. This typically begins with steadily more rich and structured logs, precision traces and emitted metrics, a practice more formally known as observability.

Observing and tracking normal behaviors, and analyzing the deeper internals of a service without unwanted manual intervention and static rules, yields the key bits of information needed to assemble all pieces of the puzzle.

These deeper internals come from observability’s three pillars: log events, distributed tracing (traces) and metrics. You need these insights analyzed together and correlated to determine and understand that a real incident might occur, or worse, is occurring.

Anomalies, events and alerts on their own will not give you this context. Traditional dashboards are designed to unify your observability data so you can interpret and decipher your own insights and context. This approach’s fundamental flaw is that it doesn’t surface the incident and root cause needed to restore a service, or prevent an outage, and protect the customer experience.

Logs

Logging is supposed to help developers and system administrators understand where and when they’ve gone wrong. Ideally, logs would suffice as indicators of what’s happening. However, logs by themselves aren’t enough. Here’s why.

Logs generate data for every request in the application or service, but, to be understood, logs must be aggregated. As a result, you end up with an immense accumulation of logs that are expensive to collect, process and store. The deluge of data from logs has gotten worse in recent years, because applications and services no longer run on a standalone box, but rather are distributed, containerized and ephemeral.

Because aggregating log data requires a central datastore, the need for bandwidth, compute processing and storage increases. IT teams often reduce the expense by keeping only some logs and discarding others.

The problem is that to gain a comprehensive understanding of what happened, all logs from all your services must be collected and stored. If you’re only retaining logs from the last n minutes or from certain log levels, you’ll lack the fundamental data to understand a malfunction.

Even when you’re collecting and storing all of your logs, you still face a challenge: Finding the information you need even when you know the time frame when the log events were generated and what you’re looking for.

Massive amounts of money have been spent on logging platforms that need to be parsed, indexed, processed and stored, at a rate of terabytes of logs per second. But that doesn’t work. It has just allowed you to write query after query that you hope returns a failure scenario and alerts you.

Put matter-of-factly: I’ve never heard of getting an ROI on a log management solution.

Traces

Troubleshooting where and why something is slow with clear visual connections is powerful. You can see which services are called, which take too long, and which fail. You have the entire path visualized and the ability to correlate the internal calls and the calls between external services. This is very difficult to understand from logs, assuming you can find matching logs, which is another challenge unto itself.

The key to understand this dynamic lies with traces. Distributed tracing isn’t a new approach. It’s a recycled, semi-automated, but albeit deeper, approach to the age-old dependency mapping with a CMDB, which is a manually created and updated database with all your assets and their dependencies.

Today, systems are containerized and built on service-oriented architectures, making it impossible to manage them in a CMDB. Distributed tracing is semi-automated in the sense that if you haven’t thought about it from application design to implementation, you must add custom code or an embedded and configured agent. This can be a lengthy and exhausting process, especially if you’ve come in later in the lifecycle of your development or don’t fully understand the legacy code or legacy thought process.

In summary, distributed tracing by itself also can’t give you the full insights needed.

This figure shows a few widely popular services and the connections between their 500+ microservices.

Metrics

Metrics, put simply, are numbers measured over intervals of time, or what’s known as time-series data. A big advantage of metrics is its generation consistency. This makes your bandwidth usage, processing and storage predictable, which isn’t possible with logs. You can predict your increases when you start to collect new metrics and the anticipated container and service growth.

While logs provide exact data, and distributed tracing gives you visualization of the service calls and dependencies, metrics allow for a deep and efficient analysis, thus offering the most value. This is especially true when it comes to statistical algorithms and detecting anomalies in your metrics.

Metrics always beg the question: “What do I need to collect?” Answer: Everything.

When it comes to analyzing metrics, an easy place to start is the infrastructure on which your services run on. The infrastructure will have built-in metrics that can be analyzed to determine the root cause of incidents.

The next step would be to measure how busy your services are by capturing the number and duration of requests of your services. Specifically, you should capture the percentage of requests that fail.

Of course, there will always be specific cases for your own application that you want to measure.

Unified Monitoring & Observability

Unification is not diagnostics. A number of observability solutions let you run ad-hoc and stored queries to stitch some data together using several dashboards, so you can visualize, interpret your insights and understand your environment and its events.

Receiving multiple alerts from disparate systems and then running ad-hoc queries while attempting to look at dashboards is flawed and will not reduce the number of 3:30 am emergency calls you receive.

Unifying your observability and monitoring is critical in managing today’s complex environments. But the crucial and only way to make sense of the data overload is to apply multiple layers of AI, beginning with data discovery and doing so through the postmortem phase.

Understanding Your Dashboards & Data

Gone are the days of static-only thresholds, and of manually selected metrics and widgets, across a theater of dashboards in a dimly lit operations center — not to mention the data lake full of useless data. I refer to the data that you don’t look at because it indicates that everything is normal.

The industry-leading Moogsoft AIOps platform is extending to each of the three pillars of observability so you can accurately detect anomalies, automatically surface important information, understand services’ normal behaviors, lower TCO, improve SLOs and error budgets, and ensure your customer experience. All in a manageable, consumable, automated and scalable way.

Tune into the next blog in this series to read how AIOps applied to metrics eliminates manual thresholds, automates anomaly detection and learns the normal operating behavior of your services.

See the other posts to date in the series here: