Observability for engineering managers

7 mins

Observability goes beyond logs, metrics, and traces, and it's up to you as a manager of engineers to set out the vision and policies to make that possible.

Observability is a term that has leapt from the obscurity of engineering textbooks and into common usage across the technology industry. At its core, observability is an ideal state, where the inner workings of software systems can be monitored and maintained. This goal has become increasingly important today, as organizations struggled to understand the inner workings of more and more complex, distributed applications.

While certain tools can help developers and site reliability engineers (SREs) strive towards this goal, it's up to you as an engineering leader to lay down the policies and a vision to make observability possible.

What is observability?

Observability as a software principle grew out of control theory, a mathematical discipline developed to explain mechanical engineering processes. In control theory, the formal definition of observability is the ability to infer the internal states of a system based on its external outputs.

It’s difficult to pin down exactly when this concept made the leap into the software domain, but a 2013 blog post by the Twitter engineering team titled observability at Twitter may be a good place to start. In the decade since, observability has become a topic of keen interest for engineering leaders aiming to build more resilient systems, as well as a useful term for monitoring and logging software vendors to use to market their products.

Monitoring vs observability

To fully understand observability, let’s compare it to simple monitoring. On the surface, they seem like they might be the same thing – surely monitoring a system is the same thing as observing it?

While monitoring is an important part of observability, achieving full observability goes beyond just throwing a bunch of network packet sniffers and log analyzers at your problem. While these tools can help you with the known unknowns – that is, they can monitor the data that you point them at, like central processing unit (CPU) uptime, or network throughput – a truly observable system is architected to surface the unknown unknowns, or problems you weren’t even aware of yet.

The three pillars of observability

Observability can be said to rest on a foundation of three important system outputs, sometimes called pillars:

Metrics offer a top-line view into system performance. They’re good for telling you what’s happening to your system at a moment in time, such as if there is an overwhelming amount of network traffic, or is latency acceptably low? Is your disk space filling up? Depending on your application or business, some of these metrics will be key performance indicators (KPIs) that tell you about the immediate health of your system and its ability to respond to user needs.
Logs provide a historical record of those metrics over time. They can help give you context for current problems so you can start diagnosing them. Say there was a mysterious spike in CPU usage this afternoon. Has anything like that happened before? Does it occur on a regular schedule, is it always at the same time as another seemingly unrelated event?
Traces record the path a user or system request takes across multiple components within a system. These are an increasingly important source of information in distributed cloud-based architectures, where requests might be routed across different virtual machine (VM) instances, containers, and components in ways that aren’t immediately obvious. It can tell you why users are seeing poor performance or where bottlenecks are degrading application speed.

Together, these three data sources can drive you toward true observability. There is also a growing set of both proprietary and open-source platforms available to help you derive useful patterns from that information.

As an engineering leader, you are responsible for choosing the tools that are the best fit for your organization’s needs. You’ll also need to figure out the best tools to instrument your systems with, as each comes at a cost.

For example, if your SREs use a dashboard with an overwhelming number of metrics on it, it’s easy to get lost in the details, so you need to know which are most important. Logs are a rich mine of data, but they can also quickly fill up disk space, so you have to decide how long to keep them and how detailed they can be. And tracing tools can slow application and network performance, so it’s important to understand to what extent that will happen and how sparingly they should be used.

Why is observability important?

The key benefit to building your systems with observability in mind is the ability to diagnose underlying problems more effectively and understand the real-time fluctuations that can affect the performance of your digital products – bringing obvious advantages for uptime and customer satisfaction.

Observability isn’t just important to SREs or engineers responsible for the uptime of an application. Developers can use a better understanding of the underlying platform to build more efficient and high-quality software at scale.

This philosophy also helps break down walls between different IT silos, which is of particular interest to the many organizations implementing DevOps, as it can help developers and operations specialists better understand the internal state of an application, and how their code affects the underlying infrastructure after deployment. That all hopefully adds up to faster troubleshooting and more efficient code development.

Observability can lead to less time spent in meetings as well. An observable platform comes closer to the dream of being “self-documenting”, which means that developers and operations staff can understand how things work at a glance, rather than needing to consult with the people who built it.

Observability challenges

Developing a truly observable platform isn't necessarily an easy task however. A recent survey from LogDNA and the Harris Poll found that 74% of respondents are struggling in their observability quest. Some of the challenges those companies cited include:

Finding tools that can support multiple use cases and allow different teams to collaborate.
Ingesting the wide variety of data produced by different tools and managing them in standardized formats.
Controlling the costs associated with data storage and management, as well as the various tools and platforms involved.

These are all challenges engineering leaders will have to overcome in their observability journey.

How can you make a system observable?

Implementing observability in practice requires two big steps. First, you need to instrument your application or infrastructure. This means putting software and network tools in place that measure all the things you need to know about your platform – collecting the metrics, logs, and traces we discussed above.

The next step is to find a platform that allows you to manage all that raw data. This can include alert management systems that let you know when metrics are out of safe territory. Dashboards will also help to coordinate data from disparate sources, allowing developers and SREs to visualize patterns of performance and get into those unknown unknowns. There is also a big opportunity in this space for machine learning to be applied to analyze all this data and provide you with insights and automated remediation actions.

You have many options when it comes to observability tools, including commercial products and open-source offerings, including the fast-emerging OpenTelemetry project. You can also piece together individual observability tools, or go for an integrated platform. You might want to start with Gartner's ratings of top APM and observability tools to assess the vendor landscape.

What comes next?

If your organization is building an application or platform from scratch, you need to ensure that observability is part of the conversation as you plan.

If you need to make your existing infrastructure more observable, start thinking about how you can instrument your current code, what skills your team needs, and what the performance tradeoffs might be.

When you're ready to learn more, check out these articles from LeadDev:

Observability for engineering managers

Posted in:

Written by:

Share:

What is observability?

Monitoring vs observability

The three pillars of observability

Why is observability important?

Observability challenges

How can you make a system observable?

What comes next?

Related content

5 mistakes to avoid when choosing a software developer analytics tool

How to plan for and mitigate different types of tech debt

The best software development analytics tools 2024

Who holds the edge in the JavaScript framework wars?

11 generative AI programming tools for developers

Researchers say generative AI isn't replacing devs any time soon

Mastering tough technical decisions

Unlocking productivity with developer platforms

12 things to consider when assessing open source software

Choose a contextualized AI coding assistant

What developers need to know about generative AI in 2024

Leading open-source teams in large organizations

Whatever happened to Big Data?

6 steps to addressing legacy enterprise code

Learning to live with legacy code

A journey to tackle legacy code in online travel

How test coverage can improve code quality

What you need to know about Biden’s AI executive order

How OpenAI fought off security threats and GPU shortages to scale ChatGPT

Balancing build vs buy decisions in a post-boom world

Building a cloud architecture that can scale to any challenge

Architecting for profit: A blueprint for modern cloud economics

3 strategies for maximizing your cloud savings

How are engineering orgs achieving reliability in 2023?

Tech debt for engineering leaders: How a shortcut today impacts tomorrow

What AI has to offer: Using LLM tools in interviews

Tech debt traps to avoid

The 6 biggest generative AI risks for developers

7 generative AI productivity hacks for developers

SRE for engineering managers

Can platform engineering help you do more with less?

When to migrate from a monolithic to a distributed frontend architecture

The essential tools for software engineering managers

Let's mitigate bias in tech

Kubernetes for engineering managers

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX

Will ChatGPT and generative AI replace internal code documentation?

Riding the ever-changing waves of front-end development

The business case for headless CMS - a quick guide for developers

What makes a front-end developer in 2023?

The case for and against building ChatGPT into your developer workflow

How to pay down your monitoring debt

Using cooperative gaming to drive positive engineering change

The workflow metrics that make elite dev teams

The four pillars of code health

Five reasons you shouldn’t rewrite that code

How to bring order to chaos engineering

A guide to measuring and improving code quality

Ways your teams can (realistically) prioritize code quality

Switching cloud infrastructure solutions? Follow these three steps.

Fixing broken windows: How to deal with legacy systems

The five stages of digital maturity

What is tech debt and how can you explain it to non-technical peers?

How to get engineering teams on board with accessibility

Introducing AIOps: The new trend for repairing software issues

How to refactor legacy systems by creating application seams

How Eve Online uses observability to ease migrations

Introducing quality ratchets: A tool for managing complex systems

How to break the cycle of tech debt

How to estimate and communicate timelines when building software

Five ways to care for your open source contributors

How to make your team fall in love with legacy code

How to redesign your architecture to reduce technical debt

Four things you need to know from ‘Using open source safely and effectively’

Four ways to empower your team through data

How to empower your open source users and contributors

How technologists can reduce our ecological footprint