Glossary/What is MTTM (Mean Time to Mitigation)

What is MTTM (Mean Time to Mitigation)

Mean time to mitigation (MTTM) measures how long it takes to stop the user impact of an incident, regardless of whether the root cause has been identified. The Google SRE Book emphasizes that during active incidents, restoring service takes priority over root cause analysis.

Why Mean Time to Mitigation Matters

Root cause analysis can extend for hours while users experience errors and revenue declines. A team mitigating in 5 minutes and resolving in 4 hours delivers superior user experience compared to one taking 2 hours to resolve but delaying mitigation until root cause is found. The formula: MTTM = Total mitigation time across all incidents / Number of incidents. Mitigation time begins at incident detection (first alert) and ends when user impact is eliminated or reduced to acceptable levels.

MTTM vs. MTTR

MTTM and MTTR follow a sequential relationship: mitigation precedes resolution, measuring different incident phases. Example: Detection at T+0 (monitoring alerts on search index freshness drop), Mitigation at T+8 minutes (failover to backup search index), Root cause discovery at T+2 hours (pipeline job holding lock), Resolution at T+3 hours (pipeline corrected, primary index restored). Result: MTTM = 8 minutes; MTTR = 3 hours. MTTM indicates user protection effectiveness; MTTR reveals how quickly underlying risk is eliminated.

Common Mitigation Strategies

Effective mitigation doesn't require root cause understanding but does require reliable, quickly-executable levers: Rollback (revert recent deployment, fastest for deploy-correlated incidents), Failover (switch traffic to healthy replica, region, or backup system), Traffic shifting (route traffic away via load balancer rules or feature flags), Scaling (add instances or increase resource limits for capacity issues), Feature flags (disable problematic features without full service shutdown), Restart (clear corrupted state and restore functionality). Build and practice these capabilities proactively via runbooks prioritizing mitigation steps before root cause investigation.

Challenges with Tracking MTTM

Many teams lack separate "user impact mitigated" timestamps, preventing MTTM calculation. Engineering culture sometimes treats mitigation as lesser achievement, causing teams to delay it favoring root cause pursuit, exactly backward during active incidents. Teams need measurable criteria defining successful mitigation, ideally tied to SLO thresholds. Incident management platforms rarely capture mitigation as distinct events. AI agents instantly correlate current incidents with similar past incidents, surfacing previously-successful mitigations within seconds. For well-understood incident types, AI agents execute mitigation automatically.

Key Takeaways

What to remember

1MTTM measures detection-to-impact-stopped time, distinct from MTTR (full resolution including root cause)
2Google SRE Book philosophy: mitigate first, investigate second, "stop the bleeding before diagnosing"
3Track MTTM separately from MTTR in incident tooling; single "resolved" timestamps hide critical information
4Build multiple mitigation capabilities (rollback, failover, scaling, feature flags) and practice regularly
5AI tools improve MTTM through instant pattern matching and automated proven mitigation execution

FAQ

Frequently asked questions

What does MTTM stand for?

MTTM stands for Mean Time to Mitigation, measuring average time from incident detection to user impact cessation, regardless of root cause identification, distinct from MTTR.

What's the difference between mitigation and resolution?

Mitigation stops user impact (e.g., 3-minute restart); resolution fixes underlying problems (e.g., 2-hour memory leak fix).

How is MTTM calculated?

Total mitigation time across incidents divided by incident count, with timer starting at detection and stopping when user-facing impact is eliminated.

Why does the Google SRE Book prioritize mitigation over root cause?

Users prioritize service restoration over understanding failures; investigating root cause while users experience outages prolongs impact.

What are common mitigation strategies?

Rollback, failover, traffic shifting, scaling, feature flags, and restart are most frequent approaches.

Should I track MTTM separately from MTTR?

Yes. They reveal different insights. Low MTTM/high MTTR indicates good impact-stopping but slow root cause finding; high MTTM suggests insufficient mitigation capabilities.

What's a realistic MTTM target?

SEV1 targets 15 minutes; SEV2 targets 30 minutes; SEV3 tolerates longer times. Targets require pre-incident mitigation capability preparation.

What's the difference between MTTA and MTTM?

MTTA (Mean Time to Acknowledge) measures alert response time; MTTM measures actual user impact cessation time. MTTA represents a subset of MTTM.

See it in action. No slides.

NeuBird AI compresses incident investigation from hours to minutes: autonomous root cause analysis, with zero manual triage.

Schedule a Demo Back to Glossary