What is toil and why is it damaging your engineering org?

7 mins

Promoted partner content

The tech industry has always had localized expressions for work that was necessary but didn’t move the company forward.

‘Busy work.’ ‘Monkey work.’ ‘Muck work.’ ‘Chores.’ Now, thanks to the SRE movement, there is a word we can all use. That word is ‘toil.’

The concept of toil is a unifying force because it provides a way of identifying – and therefore containing – the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn't move the company forward.

Why does toil matter?

Not enough time and too much to do describes the default working conditions inside IT Operations. There’s an unlimited supply of planned and unplanned work – new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.

With only so many hours in the day, how do we make sure what we’re working on actually makes a difference? How do we make sure our teams and our broader organizations are maximizing the kinds of work that add value, and finding ways to eliminate work that doesn’t?

To maximize both the value of your organization and the human potential of your colleagues, you need a framework to identify and contain the ‘wrong’ kind of work and maximize the ‘right’ kind of work. Understanding what toil is, and keeping the amount of toil contained, provides that framework. It benefits your company economically and improves the working lives of your fellow engineers. That’s a win-win situation.

Why are high levels of toil toxic?

Toil may seem innocuous in small amounts. Concern over individual incidents of toil is often dismissed with a response like ‘nothing wrong with a little busy work.’ However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization.

For the individual, high levels of toil lead to:

Discontent and a lack of feeling of accomplishment
Burnout
More errors, leading to time-consuming rework to fix
No time to learn new skills
Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)

For the organization, high levels of toil lead to:

Constant shortages of team capacity
Excessive operational support costs
Inability to make progress on strategic initiatives (the ‘everybody is busy, but nothing is getting done’ syndrome)
Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)

One of the most dangerous aspects of toil is that it requires engineering work to eliminate it. Think about the last deluge of manual, repetitive tasks you experienced. Doing those tasks doesn’t prevent the next batch from appearing.

Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.

Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.

What should we be aiming for?

Working in an organization with a high ratio of engineering work to toil feels like everyone is swimming towards a goal. When there’s a low ratio of engineering work to toil, it feels like you’re treading water, at best, or sinking, at worst.

Instead of your people spending their time on non-value-adding toil, you want them to spend as much of their time as possible on value-adding engineering work.

A goal of ‘no toil’ sounds nice in theory. However, in reality, a ‘no toil’ goal isn’t attainable in an ongoing business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil. But just because a task is necessary to deliver value to a customer, doesn’t mean that it’s always value-adding work. For people who are familiar with Lean manufacturing principles, this is not dissimilar to Type 1 Muda (necessary, non-value-adding tasks).

Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers).

It comes from sources you already know about but just haven't had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover, etc.). Toil also comes from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings, etc.).

Although we can’t get rid of toil altogether, we should learn to be effective at reducing it and keep it at a manageable level.

Reflections

Ironically, toil eats up the time needed to do the engineering work that will prevent future toil. If you aren't careful, the level of toil can increase to a point where your organization doesn’t have the capacity needed to stop it. If we use the technical debt metaphor, this would be ‘engineering bankruptcy.’

The SRE model of working – and all of the benefits that come with it – depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering system administrator with a new title.

Episode 03 Three steps for managing toil as you scale

Episode 05 Four things you need to know about managing toil as you scale

What is toil and why is it damaging your engineering org?

Posted in:

Written by:

Share:

Promoted partner content

Why does toil matter?

Why are high levels of toil toxic?

Reflections

Related content

5 ways to be a more efficient leader in 2024

11 generative AI programming tools for developers

How to keep engineering teams effective through prioritization

Unlocking productivity with developer platforms

Ask Mathias: How can I increase velocity when I’m already at full speed?

What is developer experience? Your route to better productivity

Managing the chaos of context switching

Building a prioritization framework

Introducing processes where none exist

Focus on outcomes over outputs

Embracing cycles of productivity for healthier teams

How managers can improve team efficiency

7 generative AI productivity hacks for developers

What recent data tells us about developer productivity and team health

The importance of developer agency in thriving software teams

Three ways to optimize team focus

The art of getting the best out of your team

Platform engineering for engineering managers

How to protect your calendar against unnecessary meetings

Seven ways to build effective platform teams

How engineering leaders can promote urgency in teams

How engineering leaders can better organize their day

The case for and against building ChatGPT into your developer workflow

Context switching strategies to preserve your focus

What is a developer experience team?

The surprisingly high cost of multitasking (and how to avoid it)

Four things you need to know about managing toil as you scale

Balancing delivery speed with engineering health

Overcoming security hurdles to push engineering velocity

How to break out of the thread of doom

Preventing burnout while shipping faster

Why the first two weeks are essential when building great software products

Debugging engineering teams: Groundhog Day

Learnings from 'Fostering autonomy in engineering teams'

Don’t cross the Rubicon: engineering practices you don’t want to delay

Learnings from 'Deconstructing engineering velocity'

Optimizing micro-feedback loops in engineering

Mapping the immovable objects in engineering projects

Five ways data make engineering teams stronger

Estimating your way to success

How to measure and improve success in your engineering team

The health of your business depends on the health of your engineering team

How to avoid alert fatigue

Using metrics to remove bottlenecks and support your team

Adopting an experimentation philosophy

Tell better stories with observability

How to develop engineering metrics with people, process, and tools in mind

Preventing process deadlock

Building stronger teams with AB testing

Hypothesis-driven development

Debugging engineering velocity and leading high-performing teams

Finding your groove: how to build your team’s operational cadence

Taking risks in production

Designing effective OKRs

Learning from incidents: from 'what went wrong?' to 'what went right?'

Breaking down silos for better collaboration

Solutions for creating and managing inclusive projects

Unlocking success: the components of high-performing teams

Business as usual vs. innovation: how to get the balance right

Travel through time and break free from rigid working cultures

Engage your engineers by giving them 10% time

Eiffel's Tower

Engaging your engineering team to achieve high performance faster

Tackling Big, Impossible Projects

Intro to test-driven development

Building Engineering Teams Under Pressure

Collaborative debugging on engineering teams

Revitalizing a cross-functional product organization

Failing smarter and learning faster in engineering

Creating processes that don't impede autonomy

Plug in to LeadDev