ebook cover 1

Software Engineering Metrics: The Advanced Guide

The list of the 22 most used software engineering metrics and when to use them.

Introduction

Software engineering is one of the top pillars for an organization to become one of the best performers in their industry. That’s why engineering efficiency has become the biggest challenge to enterprises, even above access to capital, according to a Stripe study. Therefore, a lot of investment has been made to create new tools and methodologies to help engineering teams build better software faster. 

Effective engineering teams see themselves as complex inter-dependent systems with a relentless focus on improving their process and output. The best teams keep track of their improvements through a set of chosen indicators, which we call software engineering metrics. The teams that don’t will see their productivity plateau fast and even deteriorate while the organization scales. 
This guide is for developers, engineering and product leaders, who are trying to improve their product development process and scale their organization. 

We’ll keep this ebook updated with new best practices, as our goal is for it to become your reference whenever you need to rethink about the software metrics that your team tracks.

Let’s begin.
01.

What are software
engineering metrics?

In software, there are 2 categories of metrics and we use different names for those:
  1. Software engineering metrics, also known as software development metrics, or software delivery performance, every team has a different name for them, it seems. What is important here is that those indicators measure how software is being built and the engineering team productivity. 
  2. Software or application performance metrics are the metrics of the software delivered, response time of the application, etc. NewRelic is typically one of the main providers of such metrics. You could also think of this in terms of customer satisfaction metrics (Net Promoter Score typically). 

In this guide, we’re focusing on the first set of metrics. Those are the ones that help an organization scale and that will actually impact the second set, which is one of the end results of the work done by the team.To understand software engineering metrics, we need to understand the main goals behind tracking and analyzing them:
  • Determine the quality and productivity of the current software delivery process, and identify areas of improvement;
  • Predict the quality and progress of the software development projects;
  • Better manage workloads and priorities between teams and team members.

Overall, engineering metrics will help teams increase the engineering return on investment, by offering an assessment of the impact of every decision on projects, processes and performance goals.
02.

The rules to track software engineering efficiently

Unfortunately, you cannot use any software metrics that might make sense to you at first glance. Some metrics just don’t make sense and are not representative of anything useful to improve your team’s productivity. But some are also just toxic, driving talents to leave it, without ever impacting performance. Every team needs to be careful about which metrics to track and how to track them. We’ll go over the list of all those metrics later, but in this section, we’ll go over a few rules on how to use those metrics efficiently so you have the right impact on your team and organization.
02.1

Software metrics should be easily understandable.

They should be:
  • Simple and computable
  • Consistent and unambiguous (objective)
  • Applied with consistent units of measurement
  • Independent of programming languages
  • Easy to calibrate and adaptable

This is why software development platforms that automatically measure and track metrics are important. 
02.2

Link software metrics to business priorities.

You can measure almost anything, but you can’t pay attention to everything. The more metrics you track at the same level of importance, the less importance you give them. 

It’s important to select the few that are most relevant to your current business priorities. Measure only what matters now. You could have for instance P1 metrics, and P2 ones. Your team could be focusing on improving P1 ones while maintaining P2 ones. It all depends on your current business goals. This implies you could be tracking some metrics temporarily before achieving some goals. For instance: 
  • Reducing the number of bugs reported;
  • Increasing the number of software iterations;
  • Speeding up the completion of tasks.

However, when developing goals, management needs to involve the software development teams in establishing them so everyone can help choose the most relevant metrics and then align behind those. You want adoption from the team for the metrics you track.

A last point, business success metrics drive software improvements, not the other way round. 
02.3

Track trends, not just numbers.

When a software metric target is met, some teams will declare success. But having only a single point of data doesn’t offer much information on how the metrics are trending. It is the trends that will show you what effect any process changes have on progress.

02.4

Set shorter measurement time frames.

Most sprint retrospectives happen on a weekly or bi-weekly basis in agile teams. So your metrics need to be shorter than these time frames in order for your team to have enough data to iterate at each sprint, but also within each sprint.

02.5

Stop using software metrics that do not lead to change.​

Comparing snowflakes is waste. Be thoughtful that the metrics you track can be acted upon, or the team will grow bored and uninterested by those metrics. Measuring them will just dilute the importance of the other metrics.

If you can’t change a metric that you thought actionable, think again if the issue is in the solution that you’re iterating on, or the metric itself that might need readjusting.
02.6

Metrics alone cannot tell you the story.​

Metrics don’t tell you why the trend is this way or another. Only the team does. Metrics should always be only a discussion starter. 

A lot of average managers make the mistake of evaluating contributors based on a set of metrics, without further discussion. Always strive to understand the ‘why’ by discussing with the protagonists, and only then will you know what the metrics actually mean.
02.7

Don’t use any metrics for individual performances.​

Resist the temptation to make any judgement on individual performances based on metrics. Metrics help you make inquiries to understand what really happened and therefore better understand the intricacies of the project and managing a team.

The reality is that management is hard and always contextual. You need to dig deeper to understand the root causes of issues. Sometimes you will find that, indeed, a developer is actually a poor performer, but it’s because you made the effort to know your team and understand how they work together that you can identify if there is one member dragging it down. This is how you become a better manager who works towards better productivity and retention within your team.
03.

Productivity metrics

These metrics are the most controversial ones, because so many people learned to hate agile story points. We are listing the most used ones in the market but will spend time explaining how and when to use these.

03.1

Project or Sprint Burndown

This metric is more about project status, than productivity per se. But it is related to the team output and therefore should be listed here. It is the metric all teams already track. Let dive a bit more into it so you can see an alternative way to look at it that will perhaps make more sense to your team and projects.

Some teams only consider the number of tasks to be done. But that would assume that all tasks are equal, which is simply not the case.    Some teams consider story points instead. That should indeed be more precise. You would need to assign points to all kinds of tasks, such as for bugs. But story points would still be limited estimations. A 5-point story might be longer to implement than 20 stories that are 1-point each!

The best way to do this is to look at your team’s history and compare the progress on the current release, to previous ones. That would give you a better indication about whether your team is on time or not.
visual 03.1 project or sprint burndown@2x
When to use it?

The two most important questions for any manager is whether they will be on time for the next milestone or sprint, and what the risks are for being late or having poor quality. So this metric is essential, even for the team itself. Plus, this metric is a great way to understand how your teams work. That’s why having the ability to compare with previous releases would significantly help you better know your team.

03.2

Ticket close rate - Beware!

Ticket close rate are the amount of stories or story points your team or any contributor solved during a certain period of time (most probably sprint). 

If you don’t include story points or some equivalent, this might be the most misleading metric you could use. Using this metric assumes that all tickets have approximately the same amount of work, and that is just not true. You should never use this metric to evaluate the individual performance of developers. A developer could fix one bug that nobody managed to solve, one that is impacting every aspect of your product’s performance, and it could take him or her a full week. In the meantime, another developer could fix 20 small not impactful bugs. Which one had the most impact on your team and company? 

Even if you have story points, in the case of the above scenario, the bug would have a story point of 5, or the 20 bugs would have a story point of 1 each. And that is without considering that most teams don’t use points for bugs, only for features.
visual 03.2 ticket close rate@2x
When to use it?

You could use this metric to identify issues like a developer being stuck on a specific task. The point is NOT to use it as a way to evaluate the developer’s performance, or your team will just game the metrics without producing any meaningful work. Use this metric only as a way to understand how you as the manager can better help your team and initiate meaningful conversations. This metric also enables you to assess the “normal” speed of your team. Across time and team members, the discrepancies between story points and actual complexity should iron out themselves.

03.3

Lines of Code (LoC) Produced or Impact - Beware!

In the same example as for “ticket close rate”, the huge bug fix could be a change of one line of code. How can you compare that to a developer who imported a library or changed every header of every file? You cannot (or you should not). And similarly, you should never use this metric to evaluate individual performances of developers. 

You can use LoC in the same way – to understand when your team is having difficulties, or maybe importing too many libraries for the sake of the quality of the project! 

Some people compute an “Impact” metric, based on the amount of code in the changes, the severity of those changes and the number of files that the changes affected. The overall goal is to offer an improvement on the LoC. The issue is you still don’t know the actual content within those lines of code. So the “Impact” metric should be used in the same way as LoC. Indeed, the one-line bug fix example mentioned above still doesn’t work, as many other real-life examples show.

visual 03.3 lines of code produced@2x
When to use them?

This is a hard question. Anything related to lines of code can’t be linked with actual developer productivity. You could use it as a secondary way to check if somebody is stuck and then to initiate conversations to help those people, but that’s it. You should NOT use it to measure velocity, even across time.

03.4

Code Churn

Code churn is typically measured as the percentage of a developer’s own code representing an edit to their own recent work. To compute it, we measure the lines of code (LoC) that were modified, added and deleted over a short period of time such as a few weeks, divided by the total number of lines of code added. Engineers often test, rework, and explore various solutions to a problem —especially towards the beginning of a project when the problem doesn’t have a clear solution yet. Some people consider code churn as non-productive work and this is where the danger lies. Indeed, here are some common causes of high churn: unclear requirements, indecisive stakeholder, a difficult problem to solve, prototyping or polishing for a release. The code churn may also indicate that the developer is optimizing part of the code for better performance. It’s completely normal that the churn will evolve along a project.Churn levels will vary between teams, individuals, types of projects and where those projects are in the development lifecycle. It is helpful to get a feeling for what your team’s “normal” looks like so you can identify when something is amiss. A sudden increase in churn rate may indicate that a developer is experiencing difficulty in solving a particular problem. 

Gitprime could compute some average in the industry for what the “normal” should be for efficient teams and less efficient ones.
visual 03.4 fundamental metrics@2x
visual 03.4 code churn@2x
When to use it?

Code Churn is really useful only when its level unexpectedly moves materially above or below the individual’s or team’s ‘normal’. In that case, it may show a problem you should concern yourself with, especially nearing a deadline as the project may be at risk.

03.5

Refactoring Rate

A common question for CTOs is ‘How much of your software engineering investment is spent on refactoring legacy code?’. 

There could be many ways you could try to measure refactoring. One such way could be through the commits if you consider that refactoring is replacing old code – for instance older than 3 weeks. In this case, you could say that refactoring effort is the percentage ratio between the lines of code which is replacing old code, on the total number of code changed. 

The issue is every way you can think of will eventually fall short of truthfully measuring refactoring. That doesn’t mean the metric described above is not useful. Consider it an indicator and track its trend. 

As codebases age, some percentage of developer attention is required to maintain the code and keep things current. The challenge is that teams need to properly balance this kind of work with creating new features. Keeping note of the trend and the team’s ‘normal’ will help you do that.
visual 03.6 new work@2x
When to use it?

Even though it’s very hard to compute actual refactoring, having some indicator and tracking its trend is very helpful for you to ensure you understand the team’s ‘normal’ and that you put enough effort on refactoring, which is essential for any software.

03.6

New Work

New work would just be defined as the lines of code added to the base, not replacing any existing code. This metric could be computed as the percentage of new code contributed on the total number of lines of code changed. In that sense, it would be complementary to code churn and refactoring.

It’s bad to have high technical debt, but it’s even worse to have a stagnant product.
When to use it?

One way to understand your team’s code effort is to measure code churn, refactoring and new work. Keeping an eye on this trend over time will help you understand your team’s actual code focus. Depending on the stage of your project, the breakdown between those 3 metrics will make sense or not.a

04.

Process metrics

Those metrics show performance of your team processes and software development workflow. These indicators are not the output of the engineering team, but are indicators showing the health of your team’s collaboration, which will directly impact the output.

04.1

Lead Time  and Cycle Time​

Lead time is the time period between the beginning of a project’s development and its delivery to the customer. Your software development team’s lead time history can help you predict with a higher degree of accuracy when an item might be ready. This data is useful even if your team doesn’t provide estimates, since the predictions can be based on the lead times of similar projects.

If you want to be more responsive to your customers, work to reduce your lead time, typically by simplifying decision-making and reducing wait time. Lead time includes cycle time.

Cycle time describes how long it takes to change the software system and implement that change in production. Teams using continuous delivery can have cycle times measured in minutes or even seconds instead of months.
visual 04.7 cycle time@2x
visual 04.7 lead time@2x
When to use them?

If your priority is to implement continuous delivery, or to make your process leaner and deploy in production more frequent smaller batches, these 2 metrics will be very useful. Within lead time, you could also dive a bit deeper to understand where most of the time is spent.

04.2

Deployment Frequency 

Tracking how often you do deployments is a good DevOps metric. Ultimately, the goal is to do more smaller deployments as often as possible. Reducing the size of deployments makes it easier to test and release.

How often you deploy to QA or pre-production environments is also important. You need to deploy early and often in QA to ensure time for testing. Finding bugs in QA is important to keep your defect escape rate down. But you might want to count production and non-production deployments separately.
visual 04.8 deployment frequency@2x
When to use it?

This metric is a good complement to lead and cycle times, in the sense that it shows their results.

04.3

Commit Frequency or Active Days

Commit frequency and active days serve the same purposes. An active day is a day in which an engineer contributed code to the project, which includes specific tasks such as writing and reviewing code. 

Those 2 alternative metrics are interesting if you want to introduce a best practice to commit every day. It’s also a great way to see the hidden costs of interruptions. Non-coding tasks such as planning, meetings, and chasing down specs are inevitable. Teams often lose at least one day each week to these activities. Monitoring the commit frequency enables you to see which meetings have an impact on your team’s ability to push code. It’s important to keep in mind that pushing code is actually the primary way your team provides value to your company. 

Managers should strive to protect their team’s attention and ensure process-overhead does not become a burden.
visual 04.9 commit frequency or active days@2x
When to use them?

Have you heard “Commit Often, Perfect Later, Publish Once”? If you fail to commit and then do something poorly thought out, you can run into trouble.  Commits is the common denominator for collaboration within your team. So if you push in your workflow to commit more often than the team currently does, it might be useful to track this metric. Plus, as mentioned above, if you want to understand the impact of interruptions, this metric could be a good starting point.

04.4

Pull Requests-Related Velocity

There are several metrics that could be interesting to you.

  • the number of pull requests opened per week
  • the number of pull requests merged per week
  • the average time to merge. Some alternative could be the percentage of pull requests merged under a certain time. This is somewhat equivalent to the cycle time (time it takes for the code to go from committing to deploy: in between, it could go through testing, QA, and staging, depending on your organization). It’s a very interesting metric that shows you what roadblocks you’re encountering in your workflow.
When to use it?

These metrics could give you a sense of the constant throughput of your engineering team. For instance, if that number doesn’t grow when you hire more people, there might be a problem related to a new process in place or a technical debt that needs to be addressed. However, if it increases too quickly you might have a quality issue. 

04.5

Work In Progress (WIP)

Work in Progress is the total number of tickets that your team has open and are currently working on. It is an objective measure of a team’s speed that is similar to throughput, as a real-time indicator (rather than a lagging one).

This metric is helpful for understanding a team’s current workload as a trend. Ideally, the number will stay stable over time, as an increase in WIP means that your team is facing blockers/bottlenecks that aren’t getting addressed (unless you added team members, of course). WIP is also a method for identifying inefficient processes.

You might also consider creating a metric that divides WIP by the number of contributors to get an average number of WIP per Developer. Ideally, this number will be close to a one to one ratio.
When to use them?

This metric helps to avoid burnout and increase efficiency, as working on one thing at a time has been shown to improve focus.

04.6

Commit or Pull Request Risks

You can determine the risks in a commit or pull request by:

  • The amount of code in the change
  • What percentage of the work is edits to old code
  • The surface area of the change (think ‘number of edit locations’)
  • The number of files affected
  • The severity of changes when old code is modified
  • How this change compares to others from the project history

It shows the amount of reflection and work has been done in the commits, and therefore the potential impact on the product, if deployed without code review. 

When to use it?

Monitoring an average commit or pull request risk helps you understand how your team works, and whether you should strive for more frequent and simpler code changes.

05.1

Number of Bugs - possibly by priority or severity​

The number of bugs will, in general, start increasing in the middle of a project’s lifecycle. A few days or weeks (depending on the size of the project) before the deadline, the team will focus on reducing the number of bugs, until the number of bugs reaches a kind of asymptote. This asymptote is eventually representative of the overall quality of the project’s product. So tracking the overall number of bugs (distinguishing their priorities) is a good indicator. 

However, not all bugs are equal. That’s why most teams assign a priority and/or severity to bugs. It could be interesting to track P1 bugs and P2 ones for instance, and only those. This depends on the maturity level of your product. For a new product, you will want to stay focused on P1s. Indeed, a product with lots of P1s will be perceived as just not working. 

If you don’t have any P1s anymore, but still P2s, users will still experience those bugs and that might impact their perception of the product negatively. 

If you want your product to be perceived as high-quality, at the level of Apple for example, this actually happens when you have tackled a lot of those P3 and P4 bugs. This is the level you need to reach. So focus on tracking only what matters for you now.
visual 05.13 number of bugs@2x
When to use them?

This metric is very helpful if your product quality is important to your business. And if so, you should constantly track it. However, if you got all P1s and P2s resolved, you might want to aim for a higher quality standard by tracking P3s for instance.

05.2

Change Failure Percentage

Accelerate defines a failure as a change that “results in degraded service or subsequently requires remediation (e.g., leads to service impairment or outage, require a hotfix, a rollback, a fix-forward, or a patch.)”. So this rating is the number of deploys resulting in a failure on the total number of deploys.Note that this definition does not include changes that failed to deploy. That information is useful, but not this KPI’s focus. 

When to use it?

If you focus on turning frequent deployments into an everyday habit, in order for this to have value, you need to keep the failure rate low. As a matter of fact, this rating should decrease over time, as the experience and capabilities of the DevOps teams increase.  An increasing failure rate, or one that is high and does not go down over time, is an indication of problems in the overall DevOps process. It is a good proxy metric for quality throughout the process.

05.3

Pull Request Quality

Pull requests can give you great visibility on the overall complexity of the code base. The more complex the code base is, the higher the chances the following metrics will be high:

  • the percentage of times pull requests break the build or fail to pass the test suite; 
  • the percentage of merged vs rejected pull requests; 
  • the number of comments by pull request – you don’t want a number that’s too low, but you also don’t want a number that is too high. This metrics show how your team collaborates and if there is enough attention drawn in your pull requests. It can be an indirect indication of the quality of the code pushed to production.
When to use them?

This metric is not about measuring the quality of the DevOps process, as for change failure percentage, but how your team works and collaborates. How are code reviews used and are those useful? Measuring the evolution of the merged versus rejected pull requests will help you understand if your team is improving with time. You could also drill down by team member to see if they are improving too.

05.4

Test Coverage Ratio

This metric is simply the ratio between the total lines of code in the piece of software you are testing, and the number of lines of code all test cases currently execute.

What is the generally accepted ‘sufficient’ test coverage when measured by number of lines of code executed? The consensus hovers around 80% – higher for critical systems (definition of critical may vary by industry, geography, user base etc.).
When to use it?

You don’t need 100% testing coverage, for sure. However, knowing where you stand and keeping track of it helps to see if you are trading velocity for quality. Keep in mind that “a high quality product built on bad requirements, is a poor quality product”, especially with test coverage.

05.5

Mean Time Between Failures (MTBF) and Mean Time To Recover (MTTR)

Both metrics measure how the software performs in the production environment. Since software failures are almost unavoidable, these software metrics attempt to quantify how well the software recovers and preserves data. 

If the MTTR value grows smaller over time, that means developers are becoming more effective in understanding issues, such as bugs, and how to fix them.

visual 05.17 mean time between failures@2x
When to use them?

These metrics are very interesting if used by the team with a specific goal: “We need to achieve this level of MTBF or MTTR on our product.” That will foster responsiveness from your team on important issues raised by customers and will help you keep a high-standard for your product, as well as for your team. To improve performance on these metrics, the team might understand they need to solve the real cause of issues, instead of easy patches.

05.6

Service-Level Agreement (SLA)

Every team has its own definition of SLA. But here is the one that Airbnb uses and that you could find very interesting. The SLA is the percentage of blocker bugs that your team fixed and deployed within a certain time (e.g., 24 hours for blocker bugs and 5 days for critical bugs). What you might really like about this metric is that it gives you a great understanding of your product quality from a user’s standpoint.

visual 05.18 service-level agreement@2x
When to use it?

This metric is very close to the MTTR, but is not limited to software failures. It extends to any types of bugs. Similarly, this metric is very interesting if used by the team with a specific goal: “We need to achieve this SLA on our product.” This metric fosters product quality ownership and responsiveness from your team. That’s why Airbnb uses it.

05.7

Defect Removal Efficiency (DRE)

The Defect Removal Efficiency is used to quantify how many defects were found by the end user after product delivery (D) in relation to the errors found before product delivery (E). The formula is: DRE = E / (E+D)

The closer to 1 DRE is, the fewer defects found after product delivery. An average DRE score is usually around 85% across a full testing programme. However, with a thorough and comprehensive requirements and design inspection process, this can be expected to lift to around 95%.
When to use them?

This metric serves a similar purpose as keeping track of the evolution for your number of bugs. It might be redundant to track both of them. We have a preference for the number of bugs as you can differentiate which bug priority matters to you now and still have a notion of the overall amount (not just the trend).

05.8

Application Crash Rate (ACR)

Application crash rate is calculated by dividing how many times an application fails (F) by how many times it is used (U). But there are actually several ways you can compute it.

  • App crashes per user:  This number shows how many users have ever faced a crash scenario. An acceptable range for this metric would be < 1%. This number should be lower for mature apps as the functions would be more stable. However, while calculating actual numbers of the whole app, faulty updates which end up in rollback instances can be ignored for an accurate representation.
  • App crashes per session: This number shows how many times an app crashed compared to number of sessions. An acceptable range for this metric would be < 0.1%. However, it can be categorized into types of sessions and app flows for a better understanding of the issue.

App crashes per screen view: This number compares the total screen views that the app has received to the number of crashes. An acceptable range for this metric would be < 0.01%. This should necessarily be categorized to understand the impact of crashes on the delivery of the functionality.

When to use it?

This metric is interesting when you have a specific goal in mind. For instance, you could strive to reach an ACR that is less than 0.25% in your software’s most critical user flows.

05.9

Defect Density

There are 2 different ways to look at defect density:

Size-oriented metrics focus on the size of the software and are usually expressed as kilo lines of code (KLOC). It is a fairly easy software metric to collect once decisions are made about what constitutes a line of code. Unfortunately, it is not useful for comparing software projects written in different languages. Some examples include errors per KLOC or defects per KLOC. 

Function-oriented metrics focus on how much functionality software offers. But functionality cannot be measured directly. So function-oriented software metrics rely on calculating the function point (FP) — a unit of measurement that quantifies the business functionality provided by the product. Function points are also useful for comparing software projects written in different languages. That metric would look at errors per FP or defects per FP

Function points are not an easy concept to master and methods vary. This is why many software development managers and teams skip function points altogether. They do not perceive function points as worth the time.
When to use them?

Size-oriented metrics that rely on lines of code make them not useful per se. So you shouldn’t compare two different software projects with them. That’s why you might not be a big fan of using it. And function-oriented metrics are difficult to compute and agree on. You might want to introduce control measures with them, but there might be better indicators for that in the list.

05.10

Age of Dependencies

Another indicator of the technical debt is how outdated the dependencies used in your code base are. It could be interesting to track this, as an average of all dependencies, possibly with a variant, so you could identify when one is very old and should require your attention.

When to use it?

This metric should be interesting to technical leads especially. This is too operational and linked to the code base to be used by a manager. If your projects have many dependencies, keeping track of dependency age should definitely be considered.

Please note that some metrics didn’t make the list, because they are not popular enough or too controversial. For instance, a code complexity evaluation from the visual complexity of the code, which is a language agnostic measure based on the depth of indentation of source code lines – also called the whitespace complexity –  is too controversial and is not really actionable per se. What business value would you have from it, as it doesn’t mean the code quality is bad.

About Anaxi

Anaxi is the system of record for software engineering organizations that need to facilitate decision processes. Anaxi opens a new data-driven era for engineering organizations, empowering them to better allocate their development time and resources, and iterate on their workflows. 

We’re in the process of building a platform for you to track those metrics and collaborate with your team. Contact us if you’re interested to learn more for your own organization.

Need visibility in your software development lifecycle?