As an engineering leader, I’m often faced with the same question: how can we keep delivery times up while protecting the health of our engineering org?
In order to build reliable systems while continually delivering value, we need to balance engineering health (our team’s ability to function successfully and maintain quality) with delivery speed. But this balance can be difficult to achieve, and there are a few common patterns that teams can fall into.
The first is firefighting (where teams are inundated with issues as a result of instability). Delivering value to customers is fraught with risks, and teams often pause projects to fix issues. The second pattern is when teams become bogged down, either by processes or by not taking enough risks, and customer value is delivered slowly.
Unfortunately, there isn’t a one-size-fits-all approach for balancing engineering health with delivery speed. You need to apply multiple methods to solve both sides of the problem depending on your team’s needs. Here I’m going to share nine strategies that I’ve used to balance engineering health with delivery – four for health, four for delivery, and one for both – the hope that they’ll help you too.
Strategies for improving engineering health
Dedicated bug days
Try dedicating a day on a weekly or fortnightly basis for your team to address bugs. This technique is helpful when bugs in the backlog accumulate faster than the time to resolve them, allowing the team to address bugs in a timely manner.
Bugs can be groomed, prioritized, and marked to be addressed within a week or fortnight. For example, bugs can be given a priority rating ranging from P1 to P3. If a bug is deemed a P1, it should be addressed immediately.
Be aware that teams will lose context when bugs require more than one day to fix. If fixing is allowed to continue beyond bug day, it can impact team capacity and delivery.
Engineering health sprints
Blocking out periods of time for your team to wind down after delivering a feature is a great way to protect team health and avoid burnout. Opt for this technique when you need to focus on the delivery of a big milestone, carving out time for the solid work, clean up, and recuperation. The advantage is that engineers are able to focus entirely on a project without distraction and then take some dedicated recovery time to breathe after a big push.
A warning: bugs and improvement tickets can accumulate during ‘focus time’ and have a negative impact on quality over time, so think carefully before you use this technique
Incidents and support processes
I’ve worked in an environment where there wasn’t a clear incident or support process, and it created a lot of confusion and a lack of ownership. When issues occurred, folks would post in the #developers channel or directly message one person. There was no official owner, and one person was deemed the problem solver by default. They spent most of their time resolving problems, not the work they should’ve been doing. It was unclear when issues were resolved, and it set a negative example to others in the organization.
To avoid this, try setting up a system where one person is allocated to a support rotation. Create guidelines on when and how to respond to issues, escalation points, and how to communicate effectively.
For incidents, assign a severity to the incident, and find a quick fix to stop the bleeding. Post-incident, work on the long-term fix to avoid the issue occurring again. Walk through the incident with the team to review what happened (the cause and response), and identify improvement actions.
Back to the example above, after implementing the support process, the team learned how to troubleshoot and deal with issues and respond in a timely manner, meaning the default problem-solver could focus on their work.
Operational reviews based on quality metrics
Make sure that you involve the team in monitoring the quality of the systems. The best way to do this is by teaching them to use and review these metrics:
- Bug rate: measure the rate of bugs that are raised vs resolved. When the rate of bugs raised is higher than the resolution rate, it results in a poor user experience (an indication that you need to focus on quality).
- Support calls: this is a direct measure of customer experience. Tracking, prioritizing, and resolving customer issues promptly is important to address quality and churn concerns.
- Tech maintenance: if you leave this bucket unattended for a long time, you’ll be opening a hole for problems to accumulate. By the time issues arise, the team will find themselves in firefighting mode, impacting their ability to deliver value.
- Incidents and alerts: measure the number of incidents, time between incidents, or time spent to resolve them. Tracking when things go wrong is a great way to understand system health and identify underlying issues.
The goal is to understand system health, spot trends, and identify strategies to address the problems. Improvement suggestions can then be added to the tech maintenance bucket, prioritized, and allocated to upcoming sprints.
By introducing your engineers to these metrics, you can allow them to grow their knowledge on troubleshooting, monitoring, and tools and data, and develop new skills in collaborating and proactively addressing concerns.
Strategies for improving delivery
Timely reviews
When tasks are stuck in review, it negatively impacts delivery time. I learned this working with a previous team where reviews were not addressed for over 24 hours. The team was working hard, and despite callouts from other engineers during standups that their tickets needed review, everyone continued working on their own tickets. The team just wasn’t collaborating. And while tasks were stuck in review, it blocked the next available task, and therefore the whole project.
A way to combat this scenario is to prompt for timely reviews. Try using the simple Slack + GitHub integration to address the issue. Team members receive notifications when PRs are ready for review, address reviews sooner and reduce the time to merge PRs, due to a faster feedback cycle.
You could also encourage pair reviews immediately after your standups, where engineers carve out time and pair up to review each other's PRs.
‘Walking the board’
This method provides a way to connect task updates during daily stand-ups. Use the board as a visual aid to map everyone’s tasks, allowing the team to see any that are blocked or need review, as well as an overview of the feature they’re delivering.
Since using this technique, I’ve seen teams become more proactive in collaborating, identifying who has been blocked, and pairing to resolve issues, give reviews, and unblock each other.
Not only does this technique help to resolve more short-term tasks, but it also provides a visual reference for the wider project, helping the team to feel more connected to overall progress.
Milestone check-ins
Do you feel like your team is working hard but progress is slow? In this case, identify milestones for what and when pieces of functionality should be done upfront and incorporate milestone check-ins into the project.
At a check-in, discuss the progress of milestones. For example, you might discover the team has hit 90% of most items but hasn’t yet completed any milestones. In this scenario, use the discussion to allow the team to reorganize themselves to complete priority items first and get the project back on track.
Delivery metrics review
Metrics allow us to track our average workflow and be proactive when anomalies appear. By teaching your team to use these delivery metrics, you can empower them to spot bottlenecks and develop approaches to combat them:
- Cycle time: use this to track when tasks complete. When anomalies arise, they usually fall into common scenarios e.g. long-running PRs, expanding refactors, slow review times, or high churn.
- PR review time: measure if reviews are being provided promptly, or whether anyone is holding back progress on a feature.
- Velocity per sprint: measure the speed at which your team is working. Each team will measure their velocity differently, from story points, time, or the number of tasks completed per sprint. The key is to have consistency and breakdown work appropriately.
Walk your team through these definitions and start sharing the metrics on a regular basis. Over time, you can identify patterns, spot anomalies, and implement potential solutions as a group (for example, breaking tasks into smaller chunks, clarifying the scope and definition of done, getting feedback earlier, rubber ducking, and agreeing on how much refactoring is needed).
…and a strategy for improving engineering health and delivery
60–30–10
In this method, an amount of time is allocated per sprint to balance delivery (60), maintenance (30), and improvements (10). This established technique allows you to balance competing priorities between product and engineering.
These numbers are an example but the ratio could also be something like 70–20–10 or 50-30–20 depending on your needs. The technique seeks to achieve a balance between competing priorities between product and engineering.
By allocating this time, you can address problems iteratively while addressing product goals, allowing for innovation, and maintaining a healthy system.
I’ve had great results with this technique. Within nine months, my team went from spending no time working on improvements or bugs tackling 50% of improvement tickets.
Reflections
Achieving balance can be difficult and takes time. But there are many great strategies you can implement to help you on your way. Whether it’s using data, tools, or processes, encouraging collaboration and ownership, or driving a culture of learning, try a few of these strategies and see which work for you. And remember, the goal is to empower your teams to own and make decisions on achieving balance and adjust as needed. Good luck!