Observability at scale

5 mins

In partnership with

The end goal of observability is to get as close as possible to knowing the cause of issues that impact the performance of systems.

What is observability?

A standard definition of observability is: ‘A measure of how well internal states of a system can be inferred from knowledge of its external outputs.'

But this is a little convoluted, so I came up with the following simplification:

'The act of exposing the state of the system, and being able to answer three questions:

What is the status of my system?
What is not working?
Why is it not working?'

Let’s inspect that definition closely, starting with ‘The act of exposing the state of the system’. This conscious action is about instrumenting code to expose the system’s state and surfacing data about itself that will help you to understand it better.

The goal is not to expose what we know we want to monitor (known unknowns), but rather the goal is to expose more data and add context that will enable us in our discovery of new failure modes (unknown unknowns).

The act of exposing the state of our systems will enable us to answer the three questions.

Thus, the end goal of observability is to get as close as possible to knowing the cause of issues that impact the performance of systems. And when implemented, it can help solve a variety of issues with results such as the enhancement of response time and the MTTR (Mean Time To Recovery).

Simple example

To make it more concrete, let’s look at the following example regarding the ‘Universe’ company and its teams ‘Earth’ and ‘Mars’.

The Universe company is using Graphite and Grafana for their metrics, and the Elasticsearch and Kibana combo stack for their logs.

Team Earth instrumented their code to expose the necessary metrics. They thought carefully about which metrics were important to their service, how to collect these metrics, and they crafted their logs to ensure they had enough context.

The team also put probes in place that would query their service and report status as perceived by clients, instead of relying only on metrics exposed by the service.

On the other hand, Team Mars only had the metrics exposed by the framework they used.

Their logs were verbose, with the intention of making them readable for humans,and they relied on the basic health checks, which were based on pinging to their homepage.

Both teams used the same tools in an effort to observe their systems, but they had different results.

During an incident, Team Earth were able to see how their service’s performance was perceived by clients, and to follow the metrics/signals through the different components until they identified a certain metric that was not within the thresholds. They would then look at their logs where they would be able to see more details about the anomaly and work to fix it.

Team Mars could look at their metrics, but wouldn't necessarily find one that was out of the norm, so they would go over to the logs and sift through all those blobs of text, scrambling to make sense out of them. They’d end up finding a fix, but the effort and frustrations would leave them demotivated.

This shows that, fundamentally, observability is about what people do with the tools at their disposal.

Who is observability for? Who will be implementing it?

Observability is best implemented by the same engineers who wrote the code, since they know their own systems the best.

In this way, it is better not to try and implement observability on behalf of engineers, but to coach and enable them to observe their services; showing them how to best observe, monitor, and understand their systems.

As your engineers are the heart of observability; trying to implement processes without their involvement and their cooperation will lead to failure.

Observability is about people. It comes down to engineers following best practices, understanding what needs to be observed, how it should be observed, and how to use that knowledge to improve the reliability of their services.

How can you implement observability?

Observability is not simply installing a few tools, giving the manual to the engineers, and expecting it will all be fine. The first question you need to ask is ‘Why should we care about observability?’

We should care about observability because it allows us to understand how our systems are behaving so that we can make them better at what they do. As we gain better visibility into our systems, we are able to better understand how they react to different external factors, such as the impact of user network connectivity, or how the limited computing resources a user has, has impacted their experience. Observability allows a company to be better equipped to succeed and provide the best user experience.

If correctly implemented, engineering can be better because of observability. Keeping this in mind helps set the stage for the work involved in the implementation. The mission is to provide the following for engineering teams:

Talk/advocate/train engineers about the principles
- Service Level Indicators, Service Level Objectives, Service Level Agreements
- Monitoring and alerting
Provide support when they start applying this knowledge
Selection of tools that are best suited to the company whether self-hosted or Software as a Service (SaaS)
- Understand the tools, strengths, and limitations, and explain those to users.

Who will do that work?

Many options are possible: a dedicated team if the business can afford it, but could also be any other central team like the tooling team (some call them platform or infra). It could also be a group of engineers who are enthusiastic about the subject and want to make a difference.

Without consistent and long-term direction from the top, staff will not believe that the change is a priority, and engineers who are pro-change will not feel empowered.

Leadership must also provide clarity about the reasons and criteria for success.

Final word on tools

Vendors will try to sell observability, but these are tools. Some are good, some are bad, and some are average –but no one can sell observability. Observability is more about people and practices. No matter what tools you use, if you don’t know what you’re doing it won’t work.

People are creative and they will find ingenious ways of using the tools to fit their thinking instead of adapting their thinking to the tools. So though they are crucial, they are not where the focus should be. Ultimately, you need to be careful when choosing the tools to ensure that they make it easier for your users to follow the best practices of observability.

Episode 01 Tell better stories with observability

Episode 03 Observability and your business

Observability at scale

Episode 01:

Episode 03:

Episode 04:

Episode 05:

Posted: 10 November 2020

Posted in:

Written by:

Share:

In partnership with

What is observability?

Simple example

Who is observability for? Who will be implementing it?

How can you implement observability?

Who will do that work?

Final word on tools

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX

Observability for engineering managers

How to pay down your monitoring debt

What is the business case for observability?

How Eve Online uses observability to ease migrations

Five mistakes to avoid when setting OKRs for your team

How Netflix, Teachers Pay Teachers, Honeycomb, and more used observability in 2021

Learnings from 'Getting your engineers invested in observability’

Making the most of observability

Thoughts on kickstarting observability

The difficulties of observability

Getting your engineers on board with observability

Preventing burnout while shipping faster

Using observability to accelerate the Engineering Flywheel

A primer on the OpenTelemetry collector

What is observability and why should you care?

Learnings from 'Observability in action'

Tell better stories with observability

Tradeoffs on the road to observability

Observability that matters (and avoiding the kind that doesn't)

Using observability to detangle and understand production

How not to burn out your monitoring team

Plug in to LeadDev

Observability at scale

Posted in:

Written by:

Share:

In partnership with

What is observability?

Simple example

Who is observability for? Who will be implementing it?

How can you implement observability?

Who will do that work?

Final word on tools

Related content

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX

Observability for engineering managers

How to pay down your monitoring debt

What is the business case for observability?

How Eve Online uses observability to ease migrations

Five mistakes to avoid when setting OKRs for your team

How Netflix, Teachers Pay Teachers, Honeycomb, and more used observability in 2021

Learnings from 'Getting your engineers invested in observability’

Making the most of observability

Thoughts on kickstarting observability

The difficulties of observability

Getting your engineers on board with observability

Preventing burnout while shipping faster

Using observability to accelerate the Engineering Flywheel

A primer on the OpenTelemetry collector

What is observability and why should you care?

Learnings from 'Observability in action'

Tell better stories with observability

Tradeoffs on the road to observability

Observability that matters (and avoiding the kind that doesn't)

Using observability to detangle and understand production

How not to burn out your monitoring team

Plug in to LeadDev