DORA The Explorer: A Guide to Engineering Performance

DORA The Explorer: A Guide to Engineering Performance

DORA The Explorer: A Guide to Engineering Performance

DORA The Explorer: A Guide to Engineering Performance

DORA The Explorer: A Guide to Engineering Performance

DORA The Explorer: A Guide to Engineering Performance

November 2025

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

TL;DR: In under 7 months, I introduced DORA metrics and investment balance to a newly formed team at Unity, improving cycle time by 33%, review time by 70%, and deployment confidence across seven live services, all while building a stronger culture of trust and ownership.

TL;DR: In under 7 months, I introduced DORA metrics and investment balance to a newly formed team at Unity, improving cycle time by 33%, review time by 70%, and deployment confidence across seven live services, all while building a stronger culture of trust and ownership.

I took over a new team in February, and I had limited data and insights into the software development productivity. The engineering team had been assembled as a result of a reorganisation, and without a leader, I stepped in. I had little idea of what I was stepping into, and I had no experience running a backend cloud services engineering team, but I do not back down from challenges easily, so here goes.

I took over a new team in February, and I had limited data and insights into the software development productivity. The engineering team had been assembled as a result of a reorganisation, and without a leader, I stepped in. I had little idea of what I was stepping into, and I had no experience running a backend cloud services engineering team, but I do not back down from challenges easily, so here goes.

A Brief History Of What Technology Has Been Assembled

Multiplayer Backend Services include Lobby, Relay, Matchmaker v2/3, Common Multiplayer Backend (the state sync service powering the Distributed Authority typology in Netcode for GameObjects), Quality of Service (QoS), Wire, Friends, and a delisted service called Realm.

The services have each been engineered in close to complete isolation, with their isolated teams. Each Service has its own production environment and tech stack; some sources are on GitLab, while others are on GitHub. There are some similarities, but it is an infrastructure spaghetti, and this presents a challenge to engineering velocity. 

A Brief History Of What Technology Has Been Assembled

Multiplayer Backend Services include Lobby, Relay, Matchmaker v2/3, Common Multiplayer Backend (the state sync service powering the Distributed Authority typology in Netcode for GameObjects), Quality of Service (QoS), Wire, Friends, and a delisted service called Realm.

The services have each been engineered in close to complete isolation, with their isolated teams. Each Service has its own production environment and tech stack; some sources are on GitLab, while others are on GitHub. There are some similarities, but it is an infrastructure spaghetti, and this presents a challenge to engineering velocity. 

What Was Measured First

The temptation with starting to measure DORA is to measure everything at once; it can be exciting as well as overwhelming, so stick to your original intent to adopt DORA and ignore the rest. You will be curious, so discipline is key here. I managed to resist the urge entirely, and I started with Investment Balance, which isn’t DORA per se. 

With only 6 engineers covering a wide scope of services, knowing where the team was spending most of their engineering time was the challenge. I need to know how much of their work was growth feature development, vs fixing customer bugs and incidents, vs sustainability practices and technical debt reduction. The team size and scope are significant factors in how I am going to manage priorities tightly, build an engineering roadmap and improve developer productivity.

My hunch was that there was an unhealthy balance towards sustainability and support, and minimal to none time for feature growth work. In an ideal world, you want the team to have a healthy balance where prioritising customers comes first, but in a sustainable method, freeing up time to work on exciting feature work that customers actually want for their production environments.

The difficulty of doing this all without some kind of platform to help me gather insights into that experience felt substantial. I went with a tool called Swarmia to help provide me with the insights required to validate my hunch and, as a next step, build strategy and deliver customer value faster.

What Was Measured First

The temptation with starting to measure DORA is to measure everything at once; it can be exciting as well as overwhelming, so stick to your original intent to adopt DORA and ignore the rest. You will be curious, so discipline is key here. I managed to resist the urge entirely, and I started with Investment Balance, which isn’t DORA per se. 

With only 6 engineers covering a wide scope of services, knowing where the team was spending most of their engineering time was the challenge. I need to know how much of their work was growth feature development, vs fixing customer bugs and incidents, vs sustainability practices and technical debt reduction. The team size and scope are significant factors in how I am going to manage priorities tightly, build an engineering roadmap and improve developer productivity.

My hunch was that there was an unhealthy balance towards sustainability and support, and minimal to none time for feature growth work. In an ideal world, you want the team to have a healthy balance where prioritising customers comes first, but in a sustainable method, freeing up time to work on exciting feature work that customers actually want for their production environments.

The difficulty of doing this all without some kind of platform to help me gather insights into that experience felt substantial. I went with a tool called Swarmia to help provide me with the insights required to validate my hunch and, as a next step, build strategy and deliver customer value faster.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

Sustainability engineering ended up being higher as predicted, but over Q2 and Q3, we shifted to a prudent and intentional strategy, targeting specific sustainability efforts. This allowed us to plan for future growth work in Q4 and Q1 while significantly reducing bottlenecks within the infrastructure and processes. Without it, we risk piling growth on top of fragile foundations, increasing the support burden on an already stretched team.

We sort based on Jira issues. Epics in Jira require a category insight (e.g. Growth, Support or Sustainability) to be set as a required field; this syncs really nicely with GitHub, and Swarmia presents the data elegantly as a result. 

Sustainability engineering ended up being higher as predicted, but over Q2 and Q3, we shifted to a prudent and intentional strategy, targeting specific sustainability efforts. This allowed us to plan for future growth work in Q4 and Q1 while significantly reducing bottlenecks within the infrastructure and processes. Without it, we risk piling growth on top of fragile foundations, increasing the support burden on an already stretched team.

We sort based on Jira issues. Epics in Jira require a category insight (e.g. Growth, Support or Sustainability) to be set as a required field; this syncs really nicely with GitHub, and Swarmia presents the data elegantly as a result. 

Moving in Parallel: Adopting DORA

In parallel, I began introducing DORA metrics as a more structured way to measure how work actually flowed through the team. Investment balance told me where we were spending our time, but DORA helped me understand how effectively that time was being converted into customer value.

I deliberately started small. Rather than chasing all DORA metrics immediately, we focused on a couple, with another three firmly in the back of my mind:

Cycle time - How long it really took from first commit to production.
Time to First Review - The time taken to review the first submission.

Moving in Parallel: Adopting DORA

In parallel, I began introducing DORA metrics as a more structured way to measure how work actually flowed through the team. Investment balance told me where we were spending our time, but DORA helped me understand how effectively that time was being converted into customer value.

I deliberately started small. Rather than chasing all DORA metrics immediately, we focused on a couple, with another three firmly in the back of my mind:

Cycle time - How long it really took from first commit to production.
Time to First Review - The time taken to review the first submission.

A Snapshot: Cycle Time in June

To make this real, here’s one example from June 2025. At that point, our average cycle time (from first commit to production) was around 40 hours per pull request. That represented a 33% improvement compared to earlier in the year.

A Snapshot: Cycle Time in June

To make this real, here’s one example from June 2025. At that point, our average cycle time (from first commit to production) was around 40 hours per pull request. That represented a 33% improvement compared to earlier in the year.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

What’s interesting here isn’t just the number itself, but what it revealed. The data showed that most work was moving through the system in under two days, with only occasional spikes. For a team still wrestling with fragmented infrastructure and heavy sustainability work, that was a positive sign: we were making meaningful progress on predictability.

The team could now see this trendline and acknowledge that improvements were compounding. Instead of cycle time being foggy, “things feel slow” feelings, we had a measurable signal that told us where we stood, and whether we are trending in the right direction.

In Swarmia, teams can set working agreements. We agreed to a <3d target for cycle time, conservative I admit, but a target we can work from and adjust as productivity matures over time.

What’s interesting here isn’t just the number itself, but what it revealed. The data showed that most work was moving through the system in under two days, with only occasional spikes. For a team still wrestling with fragmented infrastructure and heavy sustainability work, that was a positive sign: we were making meaningful progress on predictability.

The team could now see this trendline and acknowledge that improvements were compounding. Instead of cycle time being foggy, “things feel slow” feelings, we had a measurable signal that told us where we stood, and whether we are trending in the right direction.

In Swarmia, teams can set working agreements. We agreed to a <3d target for cycle time, conservative I admit, but a target we can work from and adjust as productivity matures over time.

Can you review my PR? Can you review my PR? Everybody review my PR!!

PR reviews are a metric I want to understand as quickly as possible when leading a new team. I want to understand where bottlenecks occur, and reviewing PRs, if problematic, is a relatively quick way to determine and act on them immediately.

Can you review my PR? Can you review my PR? Everybody review my PR!!

PR reviews are a metric I want to understand as quickly as possible when leading a new team. I want to understand where bottlenecks occur, and reviewing PRs, if problematic, is a relatively quick way to determine and act on them immediately.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

At the beginning, the time to first review was inconsistent, sometimes within a few hours to over a day. This provided us with an opportunity to improve, which we have done, and by June, you can see the average time came down to 8.3 hours, a 70% improvement. With the odd spike still occurring, we still have to be prudent, but the overall trend is impressive.

The target for the team is <2d to first review. Setting a target helps to drive the culture and accountability across the group. Reviewing PRs is very much about teamwork and collaboration, suppressing the ego. We discussed as a team what our experiences are, and many good outcomes came out of it.

At the beginning, the time to first review was inconsistent, sometimes within a few hours to over a day. This provided us with an opportunity to improve, which we have done, and by June, you can see the average time came down to 8.3 hours, a 70% improvement. With the odd spike still occurring, we still have to be prudent, but the overall trend is impressive.

The target for the team is <2d to first review. Setting a target helps to drive the culture and accountability across the group. Reviewing PRs is very much about teamwork and collaboration, suppressing the ego. We discussed as a team what our experiences are, and many good outcomes came out of it.

Measuring Production

Cycle time and time to first review provided us with clarity on how quickly code moved through development, but I also wanted to know what happened once said code reached production.

Deployment frequency - How often changes are safely landing in production.
Change failure rate - How often a deployment required a hotfix, rollback, or patch.
Mean time to recovery - The average time between a failed deployment and its fix.

Measuring Production

Cycle time and time to first review provided us with clarity on how quickly code moved through development, but I also wanted to know what happened once said code reached production.

Deployment frequency - How often changes are safely landing in production.
Change failure rate - How often a deployment required a hotfix, rollback, or patch.
Mean time to recovery - The average time between a failed deployment and its fix.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

We started measuring deployment frequency in June. So far, we’ve made 61 deployments year-to-date, averaging around 3.4 per week across our services.

At first glance, that sounded healthy for a team of our size and scope, but the data revealed some interesting patterns. One unexpected bottleneck was Dependabot changes. Unlike incidents or feature work, these updates often sat idle for weeks after being merged. In some cases, they weren’t deployed for up to 90 days, which inflated the numbers and hid the real delivery rhythm.

Once we separated those automated dependency updates from customer facing deploys, a clearer picture emerged. We realised we were shipping product changes at a much steadier cadence than it first appeared, but our process for lower priority technical updates was dragging behind.

The insight led to a simple but meaningful change: awareness of these lower priority updates dragging behind allowed us to make a meaningful change to our approach, reminding us that smaller, more frequent deployments can be safer and less risky. That helped restore predictability and long term we think about the time consuming manual process we take of reviewing each Dependabot PR.

Just as importantly, the visibility changed behaviour. Engineers could now see the flow of releases week by week, and it encouraged a cultural shift towards “ship it small and often” rather than batching changes.

We started measuring deployment frequency in June. So far, we’ve made 61 deployments year-to-date, averaging around 3.4 per week across our services.

At first glance, that sounded healthy for a team of our size and scope, but the data revealed some interesting patterns. One unexpected bottleneck was Dependabot changes. Unlike incidents or feature work, these updates often sat idle for weeks after being merged. In some cases, they weren’t deployed for up to 90 days, which inflated the numbers and hid the real delivery rhythm.

Once we separated those automated dependency updates from customer facing deploys, a clearer picture emerged. We realised we were shipping product changes at a much steadier cadence than it first appeared, but our process for lower priority technical updates was dragging behind.

The insight led to a simple but meaningful change: awareness of these lower priority updates dragging behind allowed us to make a meaningful change to our approach, reminding us that smaller, more frequent deployments can be safer and less risky. That helped restore predictability and long term we think about the time consuming manual process we take of reviewing each Dependabot PR.

Just as importantly, the visibility changed behaviour. Engineers could now see the flow of releases week by week, and it encouraged a cultural shift towards “ship it small and often” rather than batching changes.

Change Failure Rate and Recovery

Now we are beginning to formulate a more complete picture, and naturally, change failure rate (CFR) and mean time to recovery (MTTR) were two of the missing puzzle pieces to reflect on in production. These metrics weren’t measured in isolation; they worked together to give us a continuous, real-time view of delivery.

Change Failure Rate and Recovery

Now we are beginning to formulate a more complete picture, and naturally, change failure rate (CFR) and mean time to recovery (MTTR) were two of the missing puzzle pieces to reflect on in production. These metrics weren’t measured in isolation; they worked together to give us a continuous, real-time view of delivery.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

The CFR averaged 3.3% June 1st to the end of September. That’s a solid indicator of stability, given the scope of services the team supports and the volume of deployments that were running. The occasional spikes weren’t symptoms of widespread issues but rather concentrated incidents in specific areas. A GCP outage is one example that immediately pops to mind.

Each one became an opportunity to deepen our understanding of service reliability and strengthen testing and monitoring practices. A few examples being: improving runbooks, adding automation to restart if a certain condition is met, vs being done manually by an engineer at 3am in the morning.

The CFR averaged 3.3% June 1st to the end of September. That’s a solid indicator of stability, given the scope of services the team supports and the volume of deployments that were running. The occasional spikes weren’t symptoms of widespread issues but rather concentrated incidents in specific areas. A GCP outage is one example that immediately pops to mind.

Each one became an opportunity to deepen our understanding of service reliability and strengthen testing and monitoring practices. A few examples being: improving runbooks, adding automation to restart if a certain condition is met, vs being done manually by an engineer at 3am in the morning.

The MTTR averaged 9.6 hours. In most instances, recovery was much quicker; the longer incidents skewed the average, but the main point was visibility. We could finally see, through data, how rapidly we responded to issues and where recovery depended on specific individuals rather than automation.

The MTTR averaged 9.6 hours. In most instances, recovery was much quicker; the longer incidents skewed the average, but the main point was visibility. We could finally see, through data, how rapidly we responded to issues and where recovery depended on specific individuals rather than automation.

A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.
A table clearing showing the achievement category and the weight assigned based on the difficult of the achievement.

Results & Reflections Year To Date

What I Set Out to Do

When I stepped into this team, my goal was simple but ambitious: to bring visibility, predictability, and balance to how the team deliver value. I didn’t want to measure for measurement’s sake; I wanted meaningful insights that would help us prioritise, reduce bottlenecks, and build an elite engineering culture formed on trust and growth.

Results & Reflections Year To Date

What I Set Out to Do

When I stepped into this team, my goal was simple but ambitious: to bring visibility, predictability, and balance to how the team deliver value. I didn’t want to measure for measurement’s sake; I wanted meaningful insights that would help us prioritise, reduce bottlenecks, and build an elite engineering culture formed on trust and growth.

What We Did

To achieve this, I introduced a framework grounded in investment balance and DORA metrics. I led the adoption of Swarmia to create visibility, but the success came from how the team engaged with the data and made it part of the culture.

We approached improvement on multiple fronts, all running in parallel:

  • Investment balance gave us a view of where the time was going: growth, sustainability, or support.

  • Cycle time and review time helped to understand how efficiently code flowed through development.

  • Deployment frequency, change failure rate, and mean time to recovery revealed what was happening once work reached production.

By tracking all of these concurrently, we built a full feedback loop. Improvements in one area influenced the others. Faster reviews shortened cycle times; smaller, safer deploys reduced failure rates; and increased deployment confidence made recovery processes more consistent.

What We Did

To achieve this, I introduced a framework grounded in investment balance and DORA metrics. I led the adoption of Swarmia to create visibility, but the success came from how the team engaged with the data and made it part of the culture.

We approached improvement on multiple fronts, all running in parallel:

  • Investment balance gave us a view of where the time was going: growth, sustainability, or support.

  • Cycle time and review time helped to understand how efficiently code flowed through development.

  • Deployment frequency, change failure rate, and mean time to recovery revealed what was happening once work reached production.

By tracking all of these concurrently, we built a full feedback loop. Improvements in one area influenced the others. Faster reviews shortened cycle times; smaller, safer deploys reduced failure rates; and increased deployment confidence made recovery processes more consistent.

What We Achieved

Across the year to date, the results tell a clear story:

  • Cycle time reduced by ~33%, showing faster, more predictable delivery.

  • Time to first review dropped by 70%, improving flow and responsiveness.

  • Deployment frequency stabilised at 3–4 deploys per week with high confidence.

  • Change failure rate remained low at around 3%, with fast, visible recoveries when issues did occur.

Those metrics reflect measurable progress, but the real impact was cultural, and that’s what I’m most proud of. Culture should be at the heart of any elite performing team: one built on trust, shared ownership, and the freedom to learn and improve without fear.

I am also proud to say that more Unity Engine engineering teams is trialing DORA through Swarmia, very proud to have potentially started something bigger.

What We Achieved

Across the year to date, the results tell a clear story:

  • Cycle time reduced by ~33%, showing faster, more predictable delivery.

  • Time to first review dropped by 70%, improving flow and responsiveness.

  • Deployment frequency stabilised at 3–4 deploys per week with high confidence.

  • Change failure rate remained low at around 3%, with fast, visible recoveries when issues did occur.

Those metrics reflect measurable progress, but the real impact was cultural, and that’s what I’m most proud of. Culture should be at the heart of any elite performing team: one built on trust, shared ownership, and the freedom to learn and improve without fear.

I am also proud to say that more Unity Engine engineering teams is trialing DORA through Swarmia, very proud to have potentially started something bigger.

Buy Me A Coffee

Related posts

April 2025

Why 360 Degree Feedback Falls Short in Evaluating Individual Performance

Discover the pitfalls of using 360-degree feedback for individual performance assessments and explore more effective alternatives for modern leadership.

April 2025

Why 360 Degree Feedback Falls Short in Evaluating Individual Performance

Discover the pitfalls of using 360-degree feedback for individual performance assessments and explore more effective alternatives for modern leadership.

April 2025

Why 360 Degree Feedback Falls Short in Evaluating Individual Performance

Discover the pitfalls of using 360-degree feedback for individual performance assessments and explore more effective alternatives for modern leadership.

© Christopher Pope

© Christopher Pope

© Christopher Pope

© Christopher Pope