How to pay down your monitoring debt

7 mins

Promoted partner content

Monitoring debt leads to alert fatigue and increased operational risk, is it time you performed an alert audit and got your monitoring in order?

It will likely be a scenario familiar to any engineer who has been asked to take a pager and be on-call: the ratio of signal to noise feels completely out of sync, leading to a frustratingly large number of alerts to deal with.

This is what we refer to as “monitoring debt”. It’s the hole you dig by not changing how you monitor while you change your technical system. The longer this drift occurs, the less relevant the alert thresholds can become. The engineers on-call become numb to responding to false positives and can start to treat pages like the boy who cried wolf.

Unfortunately, the solution is not as simple as telling your engineers to just “set better alerts” either.

“Coming up with threshold values is not a trivial task,” Slawek Ligus wrote in Effective Monitoring and Alerting. “The process is often counterintuitive, and it’s simply not feasible to carry out an in-depth analysis for a threshold calculation on every monitored time series."

This issue is not unique to distributed software systems either. In fact, the bulk of research being done around this topic is in the medical field. Alarm fatigue was cited as a major patient safety concern in a 2019 paper noting health care staff must contend with an average of 700 physiologic monitors per patient each day in an environment where 80-99% of alerts are false or nuisance alarms.

Ineffective pages are an operational risk and the deeper in monitoring debt you get, the more likely your on-call engineers are to experience alert fatigue.

By conducting an alert audit, followed by continuously tuning alerts as a part of the software development workflow, you can start to set your organization on the path away from alert fatigue, and even to alert joy.

The telltale signs of monitoring debt

There are a few common symptoms of monitoring debt:

There are one or two people on your team that refine alerts, or it’s exclusively the job of site reliability engineers (SREs).
Updating or checking monitor accuracy is not a part of your deployment process.
Alert configuration is not cleanly ordered or hierarchically namespaced.

Breaking down alert fatigue

Just how risky is it to “set and forget” your monitoring policies?

A high number of false alerts over time trains workers to assume most alerts will be false. Even worse there is a compounding effect each time the same false alert is triggered. Data for medical clinicians shows the likelihood of acknowledging an alert dropped 30% for each reminder!

This is alert fatigue in action – when a high number of alerts numbs responders and leads to missed or ignored alerts, or delayed responses. The more an engineer is exposed to false alerts, the more they will tolerate, normalize, and eventually ignore them.

The effects of alert fatigue don’t disappear after an on-call shift ends either. Analysts at IDC found that 62% of IT professionals attribute alert fatigue to high turnover rates. Put simply, by not tackling alert fatigue, organizations risk losing tenured engineers to burn out.

The signs to watch out for regarding alert fatigue are:

High number of alerts.
High number of false alerts.
High number of un-actionable alerts.
Ever present stress about “if/what important signals am I missing?”
Muting alerts as a first line of defense.
Spending too much time investigating false alarms or unactionable alarms.

Alert fatigue is serious business, affecting engineers on and off the pager. Now let’s turn to mitigating the harmful effects, starting with an alert audit.

Running an effective alert audit

Start the conversation

Trying to audit your organization’s entire set of alerts is a daunting task. One approach to scoping the area of concern is to start with a particular engineering team or on-call rotation. After the inaugural audit you can focus on developing a repeatable template for other teams to follow.

A key step in building trust with operators is listening to their on-call experience and letting them know that actions will be taken to make things better. Don’t promise the world, but do let folks know that the end goal is to maintain sustainable on-call operations and that this is a shared responsibility with leadership.

If you have a healthy team dynamic, these conversations can be done synchronously over video or audio calls. Otherwise, consider asynchronous options like a survey or asking in a 1:1 before going forward with a group discussion.

Set the baseline

The baseline is a measure of the status quo. At first, the data can seem overwhelming and impossible to manage. Push through those feelings and be honest with your team and wider organization about your findings – the good, the bad, and the ugly.

Gather the facts

Look back across a defined time, be it the last week, month, quarter or whatever makes sense for your system and pull the data on pages, warnings, tickets generated, and any other basic signals from your on-call rotations. Now ask:

How often was each team member on-call?
How many pages per shift?
How many warnings per shift?
The ratio of alerts from pre-production and production environments.
How many out-of-business hours interruptions occurred?

Roll this up in a pivot table and share widely with your team. This is what will frame the feelings you gather next.

Gather the feelings

Now you should survey your on-call engineers. Specifically check in with engineers earlier in their careers, since they won’t have normalized the noise yet, and longer tenured engineers who know the hot spots but have become jaded. Walk through the baseline data with them and listen to how they interpret it and if they agree with your findings.

Ask probing questions

What percentage of time are engineers working on the sprint while on-call?
Are alerts named confusingly?
How does the team manage their monitoring configurations today?
Is there a team or vendor that is providing “out of the box” alerts?
How many alerts were received as a result of doing planned work?

A simple framework for pulling this together is:

Feeling: "None of the last five pages I got were actionable."

Fact: “The primary rotation paged five times over the last week.”

Finding: “Team X is getting paged frequently for non-actionable reasons.”

Evaluate alerts

Now you have to evaluate your alerts. All of them. Seriously.

Take one alert and look at its history: have the times it has fired been actionable for responders or not?
Walk through investigating the last time it fired.
Is the warning threshold reasonable?
Is the alert threshold reasonable?
Is there a runbook or are links to monitoring and docs easily accessible?
Decide what action to take:
- Do nothing
- Delete
- Change:
  - The tune threshold.
  - Demote to a warning.
  - Demote to a daytime ticket.

Anything that can page a human is fair game. Keep in mind that the goal is to help the human operators at the end of the day. The art of monitoring and alerting isn’t widely taught, a quick way to bring everyone up to speed is to pair on evaluating the first alert as a group together.

Ask the team: “What business impact should rise to the level of alerting a human?”

Once every team member has a good sense of the process and expectations, there are several options for completing the evaluations. Depending on the amount of engineers in the on-call rotation or the sheer number of alerts the remaining evaluations can be divided up among each engineer, folded into the primary on-call engineer’s weekly tasks or continue to be an all-team paired activity.

Re-baseline

Wait until a full rotation has passed through your team before looking back and holding an on-call retrospective. Review the actions taken to tune alerts and discuss how the experience has or has not changed the feeling of holding the pager.

If progress hasn’t been substantial or lopsided, figure out where to shift investments and continue iterating!

Celebrate, share, repeat

After the first team has figured out a workflow for auditing alerts, take a beat to recognize the improvements and investments you’ve made! Operation work tends to be unglamorous, but the benefits of tackling alert fatigue are certainly cause for celebration.

Remember to share your findings widely and openly with other teams both informally and formally.

Repeat across teams and fold alert tuning into the software development lifecycle to maintain your results.

Reflections

Why focus on alerts?

There is a deluge of data that can be alerted on and a culture of “set it and forget it” monitoring which often buries operators in low-quality signals. The 2022 Cloud Native Complexity Report from Chronosphere found 59% of respondents reported that half of their incident alerts are not actually helpful or usable. That needs to change.

If you find yourself agreeing with that statistic it might be time to tackle your monitoring debt with an alert audit. Let us know how it goes!

How to pay down your monitoring debt

Posted in:

Written by:

Share:

Promoted partner content

The telltale signs of monitoring debt

Breaking down alert fatigue

Running an effective alert audit

Start the conversation

Set the baseline

Gather the facts

Gather the feelings

Ask probing questions

Evaluate alerts

Re-baseline

Celebrate, share, repeat

Reflections

Related content

5 mistakes to avoid when choosing a software developer analytics tool

How to plan for and mitigate different types of tech debt

The best software development analytics tools 2024

Who holds the edge in the JavaScript framework wars?

11 generative AI programming tools for developers

Researchers say generative AI isn't replacing devs any time soon

Mastering tough technical decisions

Unlocking productivity with developer platforms

12 things to consider when assessing open source software

Choose a contextualized AI coding assistant

What developers need to know about generative AI in 2024

Leading open-source teams in large organizations

Whatever happened to Big Data?

A journey to tackle legacy code in online travel

6 steps to addressing legacy enterprise code

Learning to live with legacy code

How test coverage can improve code quality

What you need to know about Biden’s AI executive order

How OpenAI fought off security threats and GPU shortages to scale ChatGPT

Balancing build vs buy decisions in a post-boom world

3 strategies for maximizing your cloud savings

Building a cloud architecture that can scale to any challenge

Architecting for profit: A blueprint for modern cloud economics

How are engineering orgs achieving reliability in 2023?

Tech debt for engineering leaders: How a shortcut today impacts tomorrow

What AI has to offer: Using LLM tools in interviews

Tech debt traps to avoid

The 6 biggest generative AI risks for developers

7 generative AI productivity hacks for developers

SRE for engineering managers

Can platform engineering help you do more with less?

When to migrate from a monolithic to a distributed frontend architecture

The essential tools for software engineering managers

Let's mitigate bias in tech

Kubernetes for engineering managers

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX

Will ChatGPT and generative AI replace internal code documentation?

What makes a front-end developer in 2023?

Riding the ever-changing waves of front-end development

The business case for headless CMS - a quick guide for developers

Observability for engineering managers

The case for and against building ChatGPT into your developer workflow

Using cooperative gaming to drive positive engineering change

The workflow metrics that make elite dev teams

The four pillars of code health

Five reasons you shouldn’t rewrite that code

How to bring order to chaos engineering

A guide to measuring and improving code quality

Ways your teams can (realistically) prioritize code quality

Switching cloud infrastructure solutions? Follow these three steps.

Fixing broken windows: How to deal with legacy systems

The five stages of digital maturity

What is tech debt and how can you explain it to non-technical peers?

How to get engineering teams on board with accessibility

Introducing AIOps: The new trend for repairing software issues

How to refactor legacy systems by creating application seams

How Eve Online uses observability to ease migrations

Introducing quality ratchets: A tool for managing complex systems

How to estimate and communicate timelines when building software

How to break the cycle of tech debt

Five ways to care for your open source contributors