Learn / Glossary

MTTR & Reliability Glossary

A comprehensive guide to the metrics, concepts, and terminology used in Site Reliability Engineering, incident management, and production operations.

AI Operations

AIOps

AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to IT operations tasks. These platforms ingest telemetry and use ML to automate alert correlation, anomaly detection, event grouping, and noise reduction, transforming hundreds of alerts into manageable, prioritized clusters.

Related:Alert FatigueAI SREObservability

AI SRE

An AI SRE is an autonomous system that analyzes telemetry across IT environments to identify and investigate issues without human intervention. Distinguished from AIOps by reasoning depth: an AI SRE investigates incidents end-to-end, traces root cause, and proposes or executes remediation.

Related:AIOpsAutonomous IT OperationsOn-call

Autonomous IT Operations

An operating model in which AI agents carry out routine operations work (detection, diagnosis, remediation) without step-by-step human instruction, inside defined policy guardrails. Humans set the guardrails, approve high-risk actions, and review outcomes.

Related:AI SREAIOpsToil

Context Engineering

The discipline of dynamically assembling the right information for an AI agent at query time, rather than pre-indexing everything. Effective context engineering selects the relevant objects, tools, skills, and knowledge for each specific reasoning task.

Related:AI SREAIOps

Vibe Debugging

The use of AI to investigate and diagnose production issues by describing symptoms in natural language, rather than manually querying logs, metrics, and traces across multiple tools. The AI agent translates the description into targeted investigations across your observability stack.

Example: Describing "checkout is slow for users in the EU" to an AI agent that then queries metrics, logs, and traces to identify the cause
Related:AI SRERoot Cause Analysis (RCA)Observability

Alert Fatigue

The desensitization that occurs when engineers are exposed to a high volume of alerts, most of which are non-actionable. Signal-to-noise ratio deteriorates until critical alerts are treated identically to background noise. According to NeuBird AI's 2026 research, 83% of organizations report their teams are ignoring alerts.

Related:Alert NoisePager FatigueOn-call

2026 AI SRE Terminology Guide

A comprehensive practitioner reference defining the vocabulary of production operations in the agentic era: AI SRE, AIOps, autonomous operations, context engineering, production ops agents, and related architectural patterns.

Related:AI SREAIOpsAutonomous IT Operations

Key Metrics

MTTR (Mean Time to Recovery)

The average time it takes to recover from a failure or incident, measured from the moment the failure occurs until the system is fully restored to normal operation.

MTTR = Total Downtime / Number of Incidents
Related:MTTAMTTDMTTF

MTTA (Mean Time to Acknowledge)

The average time between when an alert is triggered and when an engineer acknowledges they are working on the issue. A key metric for measuring on-call responsiveness.

MTTA = Total Time to Acknowledge / Number of Alerts
Related:MTTRMTTD

MTTD (Mean Time to Detect)

The average time it takes to detect that an incident or failure has occurred. Lower MTTD indicates better monitoring and alerting systems.

MTTD = Total Detection Time / Number of Incidents
Related:MTTRMTTA

MTTF (Mean Time to Failure)

The average time a system or component operates before experiencing a failure. Used primarily for non-repairable systems or to measure reliability between failures.

MTTF = Total Operating Time / Number of Failures
Related:MTBFMTTR

MTBF (Mean Time Between Failures)

The average time between system failures, including both operating time and repair time. A key reliability metric for repairable systems.

MTBF = Total Operating Time / Number of Failures
Related:MTTFMTTR

MTTM (Mean Time to Mitigation)

Measures how long it takes to stop the user impact of an incident, regardless of whether the root cause has been identified. Distinct from MTTR: mitigation stops the bleeding (e.g. rollback, failover) while resolution fixes the underlying cause.

MTTM = Time of Mitigation - Time of Detection
Related:MTTRMTTDIncident

DORA Metrics

Four key indicators measuring software delivery performance and operational reliability, based on research from thousands of engineering organizations. The four metrics are: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR).

Example: Elite performers deploy multiple times per day with MTTR under one hour
Related:MTTRChange Management

Service Levels

SLI (Service Level Indicator)

A quantitative measure of some aspect of the level of service being provided. Common SLIs include latency, throughput, availability, and error rate.

Example: 99.5% of requests completed in under 200ms
Related:SLOSLA

SLO (Service Level Objective)

A target value or range for a service level measured by an SLI. SLOs are internal goals that teams set for their services.

Example: Target: 99.9% availability per month
Related:SLISLAError Budget

SLA (Service Level Agreement)

A contract between a service provider and customer that defines the expected level of service, including consequences for missing targets.

Example: 99.95% uptime guaranteed, with service credits for violations
Related:SLISLO

Error Budget

The maximum amount of time or percentage of requests that a service can fail while still meeting its SLO. Calculated as 100% minus the SLO target.

Example: With a 99.9% SLO, the error budget is 0.1% (about 43 minutes per month)
Related:SLOSLI

Reliability

Availability

The percentage of time a system is operational and accessible. Often expressed in "nines" (e.g., five nines = 99.999%).

Availability = Uptime / (Uptime + Downtime)
Related:UptimeNines

Nines (of Availability)

A shorthand way to express availability percentages. Each "nine" represents another 9 in the percentage.

Example: Two nines (99%) = 3.65 days downtime/year. Five nines (99.999%) = 5.26 minutes downtime/year
Related:AvailabilitySLO

Uptime

The total time a system or service is operational and available to users. The inverse of downtime.

Related:DowntimeAvailability

Downtime

Any period when a system or service is unavailable or not functioning correctly. Includes both planned (maintenance) and unplanned (incidents) periods.

Related:UptimeAvailabilityMTTR

Incident Management

Incident

An unplanned interruption to a service or reduction in the quality of a service. Incidents are events that require immediate attention and response.

Related:SeverityPostmortem

Severity

A classification of incident impact, typically on a scale (e.g., SEV1-SEV5). Higher severity indicates greater business impact and urgency.

Example: SEV1: Complete service outage affecting all users. SEV3: Degraded performance affecting some users.
Related:IncidentEscalation

Escalation

The process of involving additional resources, expertise, or management when an incident cannot be resolved at the current level or requires more authority.

Related:On-callIncident

On-call

A rotation system where engineers are designated to respond to incidents outside normal working hours. On-call engineers are the first responders to production issues.

Related:EscalationMTTAPager Fatigue

Postmortem

A blameless analysis conducted after an incident to understand what happened, why it happened, and how to prevent similar incidents in the future. Also called a retrospective or incident review.

Related:Root Cause AnalysisIncident

Root Cause Analysis (RCA)

A systematic process for identifying the underlying causes of an incident or problem, rather than just addressing symptoms.

Related:PostmortemFive Whys

Runbook

A documented set of procedures for handling specific types of incidents or operational tasks. Runbooks enable consistent, repeatable responses to known issues.

Related:PlaybookAutomation

Automated Incident Response

The use of software systems to detect, investigate, and resolve production incidents with minimal or no human intervention. Ranges from automated runbook execution to fully autonomous AI agents that handle the complete incident lifecycle.

Related:RunbookRoot Cause Analysis (RCA)AI SRE

Proactive Incident Management

The practice of identifying and addressing the conditions that cause incidents before they impact users, rather than waiting for failures and responding. Involves continuous telemetry analysis, anomaly detection, and risk surfacing before thresholds are breached.

Related:ObservabilityAlert FatigueSLO (Service Level Objective)

SRE Concepts

Toil

Manual, repetitive, automatable work that scales linearly with service growth. Reducing toil is a core SRE objective to free up time for engineering work.

Example: Manually restarting services, processing tickets, running reports
Related:AutomationSRE

Chaos Engineering

The practice of intentionally introducing controlled failures into a system to test its resilience and identify weaknesses before they cause real incidents.

Example: Netflix Chaos Monkey randomly terminates instances in production
Related:ResilienceFault Injection

Observability

The ability to understand the internal state of a system by examining its external outputs (logs, metrics, traces). A system is observable if you can understand why it behaves the way it does.

Related:MELTMonitoring

MELT (Metrics, Events, Logs, Traces)

The four pillars of observability data that together provide a complete picture of system behavior and health.

Related:ObservabilityTelemetry

Golden Signals

Four key metrics recommended by Google SRE for monitoring user-facing systems: Latency, Traffic, Errors, and Saturation.

Related:SLIMonitoring

Blast Radius

The scope or extent of impact when a failure occurs. Reducing blast radius through isolation and redundancy is a key reliability strategy.

Related:Fault IsolationResilience

Circuit Breaker

A design pattern that prevents cascading failures by stopping requests to a failing service after a threshold of failures is reached, allowing the service time to recover.

Related:ResilienceFault Tolerance

Graceful Degradation

A design approach where a system continues to operate with reduced functionality when some components fail, rather than failing completely.

Example: Showing cached content when the database is unavailable
Related:ResilienceFault Tolerance

Production Operations

Change Management

The process of controlling changes to production systems to minimize risk and ensure stability. Includes change review, approval, and rollback procedures.

Related:DeploymentRollback

Rollback

The process of reverting a system to a previous known-good state, typically after a failed deployment or to mitigate an incident.

Related:DeploymentChange Management

Canary Deployment

A deployment strategy where changes are rolled out to a small subset of users or servers first, allowing issues to be detected before full rollout.

Related:DeploymentBlue-Green Deployment

Blue-Green Deployment

A deployment strategy using two identical production environments. Traffic is switched from the current (blue) environment to the new (green) environment after validation.

Related:Canary DeploymentRollback

Pager Fatigue

The exhaustion and reduced effectiveness that results from too many alerts or on-call incidents. A leading cause of burnout in operations teams.

Related:On-callAlert Noise

Alert Noise

Non-actionable or low-value alerts that distract from real issues and contribute to pager fatigue. Reducing noise improves incident response effectiveness.

Related:Pager FatigueAlert Correlation

Day 2 Operations

Everything that happens to a production system after it is first deployed: patching, scaling, tuning, debugging, upgrading, monitoring, and securing. Day 2 operations represents approximately 90% of total system lifecycle costs.

Related:ToilRunbookChange Management

Production Readiness

The practice of verifying that a service meets a defined set of reliability, observability, and operational standards before it is deployed to production. Typically assessed via a production readiness review (PRR) checklist.

Example: Checklist items: runbooks documented, alerts configured, load tested, rollback procedure validated
Related:SLO (Service Level Objective)ObservabilityRunbook

Ready to improve your MTTR?

See how Production Ops Agent reduces mean time to recovery by 87% with autonomous incident investigation and resolution.

We use cookies for analytics and marketing. Privacy Policy