Attending Red Hat Summit? Join fellow leaders for an exclusive roundtable dinner on May 12

What is MTTM (Mean Time to Mitigation)

Definition

Mean time to mitigation (MTTM) measures how long it takes to stop the user impact of an incident, regardless of whether the root cause has been identified. It’s the time between “we know something is wrong” and “users are no longer affected.” This is different from mean time to resolution (MTTR), which tracks how long it takes to fully resolve the underlying problem.

A database connection pool is exhausted. Your API is returning errors to thousands of users. The on-call engineer has two choices: spend the next hour tracing why connections are leaking, or restart the affected pods right now and investigate the leak afterward. The first option is thorough. The second option stops users from seeing errors in under two minutes.

Mean time to mitigation (MTTM) measures how long it takes to stop the user impact of an incident, regardless of whether the root cause has been identified. It’s the time between “we know something is wrong” and “users are no longer affected.” This is different from mean time to resolution (MTTR), which tracks how long it takes to fully resolve the underlying problem.

The distinction between mitigation and resolution is one of the most important concepts in incident response, and one of the most commonly confused. This article explains what MTTM measures, why it matters, how to calculate it, and how to get better at it.

Why Mean Time to Mitigation Matters

The Google SRE Book makes this point directly: during an active incident, the top priority is restoring service, not finding the root cause. The book’s incident management framework explicitly separates mitigation from debugging. Your first job is to stop the bleeding. Your second job is to understand why the patient was bleeding in the first place.

This philosophy exists for a practical reason. Root cause analysis can take hours. Meanwhile, users are experiencing errors, revenue is being lost, and the blast radius may be expanding. A team that mitigates in 5 minutes and resolves in 4 hours delivers a dramatically better user experience than a team that resolves in 2 hours but doesn’t mitigate until they’ve found the root cause.

MTTM captures this distinction. It answers the question: how fast can your team restore service when something breaks?

The Formula

MTTM = Total mitigation time across all incidents / Number of incidents

Mitigation time starts when the incident is detected (or when the first alert fires) and ends when user impact has been eliminated or reduced to acceptable levels. Note that the root cause does not need to be identified for the timer to stop.

For example, if your team handled four incidents this month with mitigation times of 3 minutes, 15 minutes, 8 minutes, and 45 minutes, your MTTM would be:

(3 + 15 + 8 + 45) / 4 = 17.75 minutes

Mean Time to Mitigation vs. Mean Time to Resolution

The relationship between MTTM and MTTR is sequential. Mitigation comes first, resolution comes after. They measure different phases of the same incident.

Here’s a concrete example that illustrates the difference:

Incident: An e-commerce site’s product search is returning stale results.

  • Detection: Monitoring catches that search index freshness has dropped below the SLO threshold. (T+0)
  • Mitigation: The team fails over to a backup search index that’s slightly older but functional. Users can search again. (T+8 minutes)
  • Root cause investigation: Engineers discover that a new data pipeline job is holding a lock on the primary index, preventing updates. (T+2 hours)
  • Resolution: The pipeline job is fixed, the lock behavior is corrected, the primary index catches up, and the failover is reversed. (T+3 hours)

In this case:

  • MTTM = 8 minutes (time to restore user-facing functionality)
  • MTTR = 3 hours (time to fully resolve the underlying issue)

Both numbers are useful. MTTM tells you how well your team protects users during incidents. MTTR tells you how quickly you eliminate the underlying risk.

Check out our MTTR Calculator

Common Mitigation Strategies

Effective mitigation doesn’t require understanding the root cause. It requires having reliable levers you can pull quickly:

  • Rollback: Revert the most recent deployment. If the incident correlates with a deploy, this is often the fastest path to mitigation.
  • Failover: Switch traffic to a healthy replica, region, or backup system. This works for stateless services and databases with read replicas.
  • Traffic shifting: Route traffic away from the affected component using load balancer rules or feature flags.
  • Scaling: If the problem is capacity-related, adding instances or increasing resource limits can buy time.
  • Feature flags: Disable the specific feature causing problems without taking down the entire service.
  • Restart: Sometimes the simplest approach works. Restarting affected pods or processes clears corrupted state and restores functionality.

The Google SRE Book emphasizes that teams should build and practice these mitigation capabilities proactively, not figure them out during an active incident. Runbooks should include mitigation steps as the first section, before root cause investigation procedures.

Setting MTTM Targets

Setting realistic MTTM targets depends on your system’s architecture and your team’s operational maturity. Some guidelines:

  • For SEV1 incidents (major outages): Target mitigation within 15 minutes. If your team can’t stop user impact within 15 minutes, examine whether mitigation capabilities (rollback, failover, feature flags) are in place and practiced.
  • For SEV2 incidents (significant degradation): Target mitigation within 30 minutes. These incidents are serious but typically don’t require the same all-hands urgency as SEV1s.
  • For SEV3 incidents (minor impact): Mitigation within a few hours is often acceptable, since user impact is limited.

The important thing is to measure and trend MTTM separately from MTTR. If your MTTM is consistently low but your MTTR is high, your team is good at stopping the bleeding but slow at finding and fixing root causes. If your MTTM is high, the priority should be investing in better mitigation capabilities and faster detection.

Challenges with Tracking Mean Time to Mitigation

Despite its importance, MTTM is less commonly tracked than MTTR. Several factors contribute to this.

Confusing mitigation with resolution. Many teams don’t distinguish between the two. Their incident tracking system has a single “resolved” timestamp, with no separate field for “user impact mitigated.” Without this data point, you can’t calculate MTTM.

Pressure to find the real fix. Engineering culture sometimes treats mitigation as a lesser achievement. “We just restarted the pods” feels less satisfying than “we found and fixed the bug.” This cultural pressure can lead teams to delay mitigation in favor of pursuing root cause, which is exactly backward during an active incident.

Unclear mitigation criteria. When is impact “mitigated”? If error rates drop from 50% to 2%, is that mitigated? Teams need clear, measurable criteria for what constitutes successful mitigation, ideally tied to SLO thresholds.

Not tracking mitigation separately in tooling. Most incident management platforms let you track incident lifecycle stages, but teams often don’t configure them to capture mitigation as a distinct event. Without tooling support, the data doesn’t get recorded consistently.

Over-reliance on a single strategy. Teams that only know how to roll back struggle when the incident isn’t deploy-related. Building multiple mitigation capabilities across different failure modes is essential.

How AI is Improving Mean Time to Mitigation

AI-driven tools are improving MTTM in two specific ways.

First, faster pattern matching. AI agents can instantly correlate a current incident with similar past incidents and surface the mitigation that worked before. If your team restarted the cache layer the last three times this alert pattern appeared, the AI can suggest that action within seconds of the alert firing, rather than waiting for an engineer to remember or search through past incident records.

Second, automated mitigation execution. For well-understood incident types with proven mitigation playbooks, AI agents can execute the mitigation automatically. Neubird AI SRE connects to existing infrastructure tooling and can trigger pre-approved mitigation actions (scaling, restarts, traffic shifts) without waiting for human intervention. The key constraint is that automated mitigation should be limited to low-risk, well-tested actions. Novel failure modes still require human judgment.

The combination of institutional memory (learning from every past incident) and context engineering (assembling the right information in real time) means the system gets faster at mitigation over time. Every resolved incident teaches the AI what works for your specific environment.

Key Takeaways

  • MTTM measures the time from incident detection to user impact being stopped, which is distinct from MTTR (full resolution including root cause).
  • The Google SRE Book’s incident management philosophy is clear: mitigate first, investigate second. Stop the bleeding before diagnosing the disease.
  • Track MTTM separately from MTTR in your incident management tooling. A single “resolved” timestamp hides critical information.
  • Build multiple mitigation capabilities (rollback, failover, scaling, feature flags) and practice them regularly. Don’t figure out your mitigation strategy during an active incident.
  • AI tools are improving MTTM through instant pattern matching with past incidents and automated execution of proven mitigation playbooks.

Related Reading

Frequently Asked Questions

What does MTTM stand for? +

MTTM stands for Mean Time to Mitigation. It measures the average time from incident detection to user impact being stopped, regardless of whether the root cause has been identified. It’s distinct from MTTR (Mean Time to Resolution), which includes the full root cause fix.

What's the difference between mitigation and resolution? +

Mitigation stops the user impact. Resolution fixes the underlying problem. A team might restart a service in 3 minutes (mitigation) and then spend 2 hours finding and fixing the memory leak that caused the crash (resolution).

How is MTTM calculated? +

MTTM equals total mitigation time across all incidents divided by the number of incidents. The clock starts at detection and stops when user-facing impact has been eliminated, even if the root cause is still being investigated.

Why does the Google SRE Book prioritize mitigation over root cause? +

Because users care about service being restored, not about understanding why it broke. Investigating root cause while users are still impacted prolongs the outage. The SRE Book’s principle is “stop the bleeding before diagnosing the wound.”

What are common mitigation strategies? +

The most common are rollback (revert the recent deployment), failover (switch to a healthy replica), traffic shifting (route around the problem), scaling (add capacity), feature flags (disable the problematic feature), and restart (clear corrupted state).

Should I track MTTM separately from MTTR? +

Yes. They measure different things. A team with low MTTM but high MTTR is good at stopping user impact but slow at finding root causes. A team with high MTTM has insufficient mitigation capabilities. Tracking them separately tells you where to invest.

What's a realistic MTTM target? +

For SEV1 incidents, target mitigation within 15 minutes. For SEV2, target 30 minutes. SEV3 can tolerate longer mitigation times since user impact is limited. These targets require having proven mitigation capabilities ready before incidents occur, not building them during the response.

What's the difference between MTTA and MTTM? +

MTTA (Mean Time to Acknowledge) measures how long it takes for someone to respond to an alert after it fires. MTTM measures how long it takes to actually stop user impact. MTTA is a subset of MTTM: you have to acknowledge before you can mitigate. A team with low MTTA but high MTTM is responsive but slow at applying fixes.

Is mitigation the same in cybersecurity and SRE? +

The general concept is similar (reducing impact), but the specifics differ. In cybersecurity, mitigation might mean blocking an attacker, isolating a compromised system, or applying a patch. In SRE, mitigation typically means restoring user-facing service through rollbacks, failovers, or scaling. Both prioritize stopping harm before fully understanding the cause.

How do you calculate mean time to mitigate? +

MTTM equals the total time across all incidents from detection to user impact being stopped, divided by the number of incidents. The clock starts at detection (or alert firing) and stops when monitoring confirms user-facing impact has been eliminated, regardless of whether root cause is known.

What is mitigation vs containment? +

In incident response, mitigation reduces or eliminates user impact (rollback a deployment, fail over to a healthy region). Containment prevents the issue from spreading further (isolate affected services, block traffic to compromised systems). For most production incidents, mitigation is the primary concern; containment matters more in security incidents.

# # # # # #
Secret Link