How can we eliminate unnecessary toil for site reliability engineering teams?
Psychological safety is crucial for building high-performing teams. These types of teams are not built passively – they require continuous intention and work to build or maintain the trust necessary for producing high-quality, impactful work.
There are many influencing factors that define whether a culture is psychologically safe or not, but one overlooked threat for site reliability engineering teams is toil, particularly when the amount of toil has crossed the line of expected to unreasonable.
Defining a high-toil environment
Toil is the dull, repetitive, automatable engineering work that doesn’t contribute any significant value. Every team should prepare for some degree of toil since it’s a natural consequence of operating services. Site reliability engineering teams should, however, still be primarily focused on the long-term project work that will allow their services to scale or grow in functionality.
When a team reaches the point where it’s chronically spending more time on toil than on project work that helps the business move forward, that team is operating in a high-toil environment. This is particularly true because of the type of toil we’re most concerned with, which is toil that grows with service growth.
As we continue navigating the challenges of scaling and growing our services, we want to make sure that we have the right set of priorities to enable this work. If we’re stuck in a state where our toil is increasingly taking up our time, we not only risk being unable to foster growth, we risk eventually being unable to meet the expectations of our current growth.
The human consequences of toil
The cost or risk of high-toil environments isn’t just limited to the resources and time it takes away from. There’s also a cost specific to the engineers working in these high-toil environments. This manifests itself in the form of distraction, which introduces a threat to psychological safety in multiple ways.
Growth stagnates
The worst consequence of toil is that it distracts engineers from doing their most meaningful and best work. By definition, toil takes away time that would otherwise be spent on engineering projects that push the edges of our knowledge and skill sets. Growth opportunities are essential for job satisfaction, so while these toil tasks are necessary, their inability to provide new opportunities to grow as engineers can be detrimental to engineers' relationships with work.
Collaboration becomes harder
The reactive nature of toil often means engineers are context switching frequently to address the multiple sources of toil. The isolation that comes from working on these quickly changing, manual tasks makes it harder to find opportunities to collaborate with other engineers. Collaboration is essential for keeping engineers engaged, so the distracting quality of toil is able to erode at yet another area of our work life.
Culture cost
The way lost growth and collaboration opportunities impact our individual relationships with work and our teammates can naturally lead to an overall negative team culture. Teams can become low-energy, resigned to the circumstance of constantly being in fire-fighting or repetitive-task mode. This degrades the ability to think critically about what can be improved over time, leaving engineers unable to operate as their best selves.
Psychological cost
High-toil environments leave little room to address the ongoing challenges of chronically dealing with toil. Issues that are left unaddressed drive people away from the environments that allow them to fester. The psychological impact that comes from not having their needs met or their voices heard can drive engineers to look for new roles or teams. Ultimately, they leave, thinking it doesn’t have to be this way.
Strategies for fixing and preventing high-toil environments
These engineers are right: it doesn’t have to be this way. It’s much easier to prevent a team from becoming a high-toil team than it is to fix one, but the strategies we’ll talk about in this section can be applied to both circumstances.
The important thing to note about ‘fixing’ a high-toil environment is that there’s usually at least some amount of ‘damage control’ that needs to be done to get to a healthier place. The most important thing you can do in these situations is restore your team’s sense of empowerment.
Being in a state of high toil isn’t one person’s challenge to solve on their own. Each engineer on the team should feel a sense of empowerment to take ownership of their own and collective team experience. As engineering leaders, making sure engineers have the avenues to channel this energy is crucial.
Build robust feedback loops
To restore or build trust with your engineers, they need to be heard. Feedback loops serve the function of communicating pain points throughout the team. What this looks like can vary – every team has its own way of doing things – but we should make it easy to find patterns that can serve as input for short-term and long-term improvements. Coupled with a prioritization framework, like service level objectives, having a robust feedback loop brings your team one step closer towards being resilient to toil.
Address challenges with short- and long-term work
Actioning feedback is just as important as collecting it. Since managing toil is fundamental to site reliability engineering, having both a short-term and long-term approach for eliminating toil sets teams up for success. Since your long-term approach to eliminating toil should involve long-term project planning, here we’ll focus on changes you can make in the short term.
Toil emerges from many sources, but the largest source of our most interrupt-heavy toil often comes from on-call shifts. Given this, instead of forcing on-call engineers to balance both on-call and long-term project work, empower the engineers on-call to take ownership over how they spend their on-call time outside of incidents.
This type of on-call shift, ‘dedicated on-call,’ not only relieves the stress of needing to manage incidents and project work at the same time, but it also communicates a trust in teammates to help make on-call better over time. It empowers developers to solve the problems that bother them most.
This doesn’t mean you can only rely on dedicated on-call to mitigate toil on your team. Not every issue or improvement will be able to fit into an on-call shift. This doesn’t work for improvements that require a lot of time or more intense collaboration, but it is an additional layer of reassurance. It’s a way of keeping your team accountable to itself.
Celebrate success
Lastly, find ways to celebrate progress and success. Eliminating or automating away toil is meaningful work. It’s work that allows your services to scale and keeps your team happier in their jobs, and it should be treated as such.