What is MTTR (Mean Time to Resolution)?
Definition
Mean time to resolution (MTTR) is one of the most widely tracked reliability metrics in software engineering. It measures the average time between when an incident is detected and when it’s fully resolved, including root cause identification and any follow-up fixes. MTTR gives engineering teams a concrete number to benchmark against, track over time, and use as a signal for how well their incident response process is actually working.
Your checkout service went down at 2:47 AM on a Tuesday. The on-call engineer got paged, pulled in two teammates, and spent the next four hours tracing the problem through logs, metrics, and deployment history. By 6:52 AM the service was back, the root cause was documented, and the ticket was closed. That four-hour window is your mean time to resolution for that incident.
What is MTTR?
Mean time to resolution (MTTR) is one of the most widely tracked reliability metrics in software engineering. It measures the average time between when an incident is detected and when it’s fully resolved, including root cause identification and any follow-up fixes. MTTR gives engineering teams a concrete number to benchmark against, track over time, and use as a signal for how well their incident response process is actually working.
This article breaks down what MTTR really measures, how to calculate it, the common mistakes teams make when tracking it, and how modern AI-driven approaches are changing what’s possible.
Understanding Mean Time to Resolution
Before diving into the formula, it’s worth addressing a common source of confusion. The acronym “MTTR” gets used to mean at least four different things depending on who you ask:
- Mean Time to Resolution measures the full lifecycle from detection to complete resolution, including root cause analysis and any corrective actions. This is the broadest definition and the one most SRE teams use today.
- Mean Time to Recovery measures how long it takes for a system to return to normal operation after a failure. DORA research uses this variant as one of its four key software delivery metrics.
- Mean Time to Repair focuses specifically on the active repair work, excluding detection and diagnosis time. It’s more common in hardware and manufacturing contexts.
- Mean Time to Respond (sometimes called Mean Time to Acknowledge) measures only how quickly someone begins working on an incident after it’s detected.
These distinctions matter because comparing your team’s MTTR to an industry benchmark is meaningless if you’re measuring different things. When someone reports an “MTTR of 45 minutes,” the first question should always be: which MTTR?
The Formula
The calculation itself is straightforward:
MTTR = Total resolution time across all incidents / Number of incidents
For example, if your team resolved five incidents this month with resolution times of 1 hour, 3 hours, 30 minutes, 6 hours, and 2 hours, your mean time to resolution would be:
(1 + 3 + 0.5 + 6 + 2) / 5 = 2.5 hours
Most teams calculate MTTR on a rolling window (weekly or monthly) and segment by severity level. A P1 MTTR of 2.5 hours tells a very different story than a P4 MTTR of 2.5 hours.
Calculate your MTTR
Use this calculator to find your team’s MTTR from real incident data. Quick mode takes your total incident count and total downtime. Detailed mode lets you enter each incident’s resolution time individually for a more precise average.
The calculator compares your result against DORA performance benchmarks. Elite teams recover in under an hour. Most organizations land between 1 and 24 hours. If your number puts you in the Medium or Low range, the bottleneck is almost always the investigation phase, not the fix itself.
MTTR Calculator
Measure your current mean time to recovery from total downtime or incident-by-incident resolution times.
Example: 5 incidents with 225 total minutes of downtime = 45 min MTTR.
Compare your MTTR against DORA-style incident response tiers.
MTTR between 1 hour and 24 hours
| Level | MTTR |
|---|---|
| Elite | < 1 hour |
| High | 1 hour to 24 hours |
| Medium | 1 day to 1 week |
| Low | > 1 week |
How Mean Time to Resolution Works in Practice
MTTR becomes most valuable when you track it over time and break it down into phases. A typical incident resolution timeline has four distinct stages:
- Detection (time to notice something is wrong)
- Triage (time to assess severity and assign the right people)
- Diagnosis (time to identify the root cause)
- Fix and verification (time to implement and validate the repair)
Consider a mid-size e-commerce company tracking MTTR across a month. They log 12 incidents:
Looking at the raw MTTR numbers alone, you’d see an average of about 2 hours and 23 minutes. That’s useful as a trend line, but the real insight comes from the phase breakdown. In every incident above, diagnosis took the longest. The team’s detection and triage were fast. Their fix implementation was fast. But figuring out what was actually wrong consumed 60-80% of total resolution time.
This is a common pattern. According to DORA’s State of DevOps research, elite-performing teams restore service in under one hour, while low performers can take anywhere from one week to one month. The gap almost always comes down to how quickly teams can diagnose problems, not how quickly they can deploy fixes.
MTTR as a DORA Metric
DORA (DevOps Research and Assessment, now part of Google Cloud) established four key metrics for measuring software delivery performance. Their variant, “time to restore service,” captures how long it takes to recover from a production failure. It’s one of four metrics alongside deployment frequency, lead time for changes, and change failure rate.
The Accelerate State of DevOps Report performance benchmarks break down roughly as:
- Elite: Less than one hour
- High: Less than one day
- Medium: Between one day and one week
- Low: More than one week
These benchmarks provide useful context, but they’re industry-wide averages. Your target MTTR should reflect your system’s specific reliability requirements and your users’ tolerance for downtime.
Common Pitfalls When Tracking Mean Time to Resolution
MTTR seems simple to measure, but teams consistently run into the same problems.
Inconsistent clock starts. Does your MTTR timer begin when the incident occurs, when monitoring detects it, or when a human acknowledges the alert? If different teams or even different incident types use different starting points, the aggregate number is meaningless. Pick one definition and enforce it across the organization.
Mixing severity levels. Averaging P1 and P4 incidents together produces a number that’s useful for neither. A team with one 8-hour P1 and twenty 15-minute P4s might report a “great” MTTR that completely hides a serious problem. Always segment by severity.
Gaming the metric. When MTTR becomes a target tied to performance reviews or SLAs, teams find creative ways to hit the number. Closing tickets before the root cause is actually identified, splitting one incident into multiple smaller ones, or reclassifying severity after the fact. Goodhart’s Law applies: when a measure becomes a target, it ceases to be a good measure.
Ignoring outliers. A single 72-hour outage can destroy an otherwise healthy monthly MTTR. Consider tracking both mean and median, or using percentiles (P50, P90, P95) to get a more accurate picture.
Not tracking the phases. A single MTTR number tells you the “what” but not the “why.” If you don’t break resolution time into detection, triage, diagnosis, and fix phases, you can’t identify which part of your process needs improvement.
How AI is Reducing Mean Time to Resolution
The diagnosis phase is where most resolution time goes. An engineer pulls up dashboards, queries log aggregators, checks recent deployments, traces requests across services, and tries to build a mental model of what changed and why. In complex distributed systems with dozens of services, this can take hours.
This is where AI-driven investigation is making the biggest impact. Instead of a human manually correlating signals across five different tools, an AI agent can ingest alerts, pull relevant logs and metrics, cross-reference recent code changes, trace dependencies, and produce a root cause analysis in minutes rather than hours.
Platforms like NeuBird AI use context engineering to dynamically assemble the right information at query time, rather than requiring engineers to know which dashboard to check or which log group to search. The result is a direct compression of the diagnosis phase, which is typically the longest phase in the resolution timeline. NeuBird reports 94% accuracy in root cause identification, which means fewer false starts and less time spent chasing the wrong signals.
The shift isn’t about replacing engineers. It’s about eliminating the hours spent on manual correlation so that humans can focus on the decision-making that actually requires judgment: approving a fix, deciding whether to roll back, or assessing blast radius.
Key Takeaways
- MTTR (mean time to resolution) measures the average time from incident detection to full resolution, but clarify which “MTTR” you’re tracking since the acronym covers four different metrics.
- The formula is simple (total resolution time / number of incidents), but segment by severity and track over rolling windows for actionable data.
- Break MTTR into phases (detection, triage, diagnosis, fix) to identify where your process is slowest. For most teams, diagnosis dominates.
- Avoid common traps: inconsistent clock starts, mixing severity levels, gaming the metric, and ignoring outliers.
- AI-driven investigation tools are compressing the diagnosis phase from hours to minutes, which directly reduces overall MTTR.
Related Reading
- What is Incident Management? – The end-to-end process that MTTR helps measure and improve.
- What is Alert Fatigue? – How alert noise delays detection and inflates MTTR.
- What is an AI SRE? – How AI agents are changing incident response and reliability engineering.
- Tackling Observability Scale with Context Engineering – How context engineering enables faster diagnosis in complex environments.
Frequently Asked Questions
What does MTTR stand for? +
MTTR can stand for Mean Time to Resolution, Mean Time to Recovery, Mean Time to Repair, or Mean Time to Respond. The most common usage in SRE and DevOps is Mean Time to Resolution, which measures the full lifecycle from incident detection to complete resolution including root cause fix.
How is MTTR calculated? +
MTTR is calculated as total resolution time across all incidents divided by the number of incidents. For example, if your team resolved 5 incidents totaling 12.5 hours, your MTTR is 2.5 hours. Most teams calculate MTTR on a rolling weekly or monthly window.
What is a good MTTR benchmark? +
According to DORA research, elite-performing teams restore service in under one hour. High performers do it in less than a day. Medium performers take between one day and one week. Low performers take more than a week. Your target should reflect your specific reliability requirements, not just industry averages.
What's the difference between MTTR and MTTM? +
MTTR (Mean Time to Resolution) measures the full incident lifecycle including root cause fix. MTTM (Mean Time to Mitigation) measures only how long it takes to stop user impact, which usually happens before root cause is found. A team can mitigate in 5 minutes and still take hours to fully resolve.
Why is my MTTR getting worse? +
Common causes include growing system complexity, alert fatigue delaying detection, observability gaps that slow diagnosis, increased on-call burden, and accumulating technical debt. Break MTTR into phases (detection, triage, diagnosis, repair) to identify which phase is degrading.
Can AI actually reduce MTTR? +
Yes, significantly. AI directly compresses the diagnosis phase, which typically consumes 40-60% of total MTTR. AI agents can simultaneously query metrics, logs, traces, and deployment history, then correlate findings to identify root causes in minutes instead of hours. NeuBird AI reports 94% accuracy in automated root cause identification, which translates directly to lower MTTR and fewer 3 AM investigations.
Should MTTR be tied to engineer performance reviews? +
Generally no. When MTTR becomes a personal performance metric, engineers find ways to game it: closing tickets early, reclassifying severity, or splitting incidents. Track MTTR as a team and system metric, not an individual one.
What is the difference between MTTR and MTBF? +
MTBF (Mean Time Between Failures) measures reliability: how long a system runs before failing. MTTR measures recoverability: how long it takes to restore service after a failure. Together they describe your system’s overall availability profile. MTBF is about preventing failures; MTTR is about recovering from them.
Is MTTR a KPI? +
Yes, MTTR is one of the most widely tracked operational KPIs in software engineering. It’s a key indicator of incident response effectiveness and is included as one of the four DORA metrics that measure software delivery performance.
How do you calculate MTTR in hours? +
Add up the total resolution time for all incidents in your measurement period (in hours), then divide by the number of incidents. For example, 5 incidents totaling 12.5 hours of resolution time gives you an MTTR of 2.5 hours.
What is a good MTTR percentage? +
MTTR isn’t measured as a percentage, it’s measured in time units (minutes or hours). You may be thinking of availability, which is often expressed as a percentage (99.9%, 99.99%). MTTR contributes to availability: lower MTTR means higher availability for the same number of incidents.