How can you create a healthy balance for SRE teams as your company grows?
Toil is the manual, repetitive work that keeps an engineering team running, but doesn’t deliver any enduring value. In this series, we explored how companies can manage this operational burden when scaling. Here are the highlights!
Episode 01: How to manage toil as you scale
How are SRE teams in other engineering orgs dealing with toil? In this panel, we brought together a small group of leading SREs to discuss how they’ve managed the time they dedicate to toil in a way that lets their services scale.
Featuring Molly Struve (Senior Site Reliability Engineer at Netflix), Dileshni Jayasinghe (Senior Engineering Manager at PagerDuty), Alex Hidalgo (Principal Reliability Advocate at Nobl9), Praise Ogunnowo (DevOps and Site Reliability Engineer at Deimos), and Johnny Boursiquot (Platform Observability Engineer at Heroku), the panel explored:
- How to develop processes that balance toil and project work to support your company’s rate of growth
- How to reduce unnecessary toil to allow SRE teams to contribute more effectively
- How to create fulfillment in your SRE roles to improve productivity and retention
- How the right amount of toil can actually be beneficial to your org
Episode 02: Fixing and preventing high-toil environments
Toil distracts engineers from doing their most meaningful work, takes away time that would otherwise be spent on growth projects, gets in the way of collaboration, and drives away engineers who feel their needs aren’t being met.
But it doesn’t have to be this way. In this article, Lesley Cordero shares practical strategies for fixing and preventing high-toil environments, from using robust feedback loops to build resilience and trust, to recognizing eliminating toil as meaningful work.
Episode 03: Three steps for managing toil as you scale
Toil can show up in processes as well as infrastructure and architecture when a company scales. So what’s a growing engineering org to do?
In this article, Adam Shepard shares his experience of managing toil as his company scaled. Outlining their approach for growing every level of the business – first they scaled by adding people, then they standardized with process, then they automated with technology – he walks us through how they reduced toil each step of the way.
Episode 04: What is toil and why is it damaging your engineering org?
Monkey work, muck work, chores, toil. In this article, Greg Chase introduces the concept of toil and why it’s important, providing a framework for identifying and containing the ‘wrong’ kind of work and maximizing the ‘right’ kind of work.
After outlining why high levels of it are toxic (it leads to burnout, gets in the way of career progression, and creates excessive operational costs), he shares what engineering orgs should be aiming for: you can’t get rid of toil altogether, but you can learn to be effective at reducing it.
A final takeaway
According to Greg Chase, ‘Ironically, toil eats up the time needed to do the engineering work that will prevent future toil. If you aren't careful, the level of toil can increase to a point where your organization doesn’t have the capacity needed to stop it.’ By listening to and supporting your SRE engineers, planning projects far in advance with toil in mind, and celebrating toil elimination work, you can slowly but surely reduce your operational burden as a team.