Learn / Glossary
MTTR & Reliability Glossary
A comprehensive guide to the metrics, concepts, and terminology used in Site Reliability Engineering, incident management, and production operations.
AI Operations
AIOps
AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to IT operations tasks. These platforms ingest telemetry and use ML to automate alert correlation, anomaly detection, event grouping, and noise reduction, transforming hundreds of alerts into manageable, prioritized clusters.
AI SRE
An AI SRE is an autonomous system that analyzes telemetry across IT environments to identify and investigate issues without human intervention. Distinguished from AIOps by reasoning depth: an AI SRE investigates incidents end-to-end, traces root cause, and proposes or executes remediation.
Autonomous IT Operations
An operating model in which AI agents carry out routine operations work (detection, diagnosis, remediation) without step-by-step human instruction, inside defined policy guardrails. Humans set the guardrails, approve high-risk actions, and review outcomes.
Context Engineering
The discipline of dynamically assembling the right information for an AI agent at query time, rather than pre-indexing everything. Effective context engineering selects the relevant objects, tools, skills, and knowledge for each specific reasoning task.
Vibe Debugging
The use of AI to investigate and diagnose production issues by describing symptoms in natural language, rather than manually querying logs, metrics, and traces across multiple tools. The AI agent translates the description into targeted investigations across your observability stack.
Alert Fatigue
The desensitization that occurs when engineers are exposed to a high volume of alerts, most of which are non-actionable. Signal-to-noise ratio deteriorates until critical alerts are treated identically to background noise. According to NeuBird AI's 2026 research, 83% of organizations report their teams are ignoring alerts.
2026 AI SRE Terminology Guide
A comprehensive practitioner reference defining the vocabulary of production operations in the agentic era: AI SRE, AIOps, autonomous operations, context engineering, production ops agents, and related architectural patterns.
Key Metrics
MTTR (Mean Time to Recovery)
The average time it takes to recover from a failure or incident, measured from the moment the failure occurs until the system is fully restored to normal operation.
MTTA (Mean Time to Acknowledge)
The average time between when an alert is triggered and when an engineer acknowledges they are working on the issue. A key metric for measuring on-call responsiveness.
MTTD (Mean Time to Detect)
The average time it takes to detect that an incident or failure has occurred. Lower MTTD indicates better monitoring and alerting systems.
MTTF (Mean Time to Failure)
The average time a system or component operates before experiencing a failure. Used primarily for non-repairable systems or to measure reliability between failures.
MTBF (Mean Time Between Failures)
The average time between system failures, including both operating time and repair time. A key reliability metric for repairable systems.
MTTM (Mean Time to Mitigation)
Measures how long it takes to stop the user impact of an incident, regardless of whether the root cause has been identified. Distinct from MTTR: mitigation stops the bleeding (e.g. rollback, failover) while resolution fixes the underlying cause.
DORA Metrics
Four key indicators measuring software delivery performance and operational reliability, based on research from thousands of engineering organizations. The four metrics are: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR).
Service Levels
SLI (Service Level Indicator)
A quantitative measure of some aspect of the level of service being provided. Common SLIs include latency, throughput, availability, and error rate.
SLO (Service Level Objective)
A target value or range for a service level measured by an SLI. SLOs are internal goals that teams set for their services.
SLA (Service Level Agreement)
A contract between a service provider and customer that defines the expected level of service, including consequences for missing targets.
Error Budget
The maximum amount of time or percentage of requests that a service can fail while still meeting its SLO. Calculated as 100% minus the SLO target.
Reliability
Availability
The percentage of time a system is operational and accessible. Often expressed in "nines" (e.g., five nines = 99.999%).
Nines (of Availability)
A shorthand way to express availability percentages. Each "nine" represents another 9 in the percentage.
Uptime
The total time a system or service is operational and available to users. The inverse of downtime.
Downtime
Any period when a system or service is unavailable or not functioning correctly. Includes both planned (maintenance) and unplanned (incidents) periods.
Incident Management
Incident
An unplanned interruption to a service or reduction in the quality of a service. Incidents are events that require immediate attention and response.
Severity
A classification of incident impact, typically on a scale (e.g., SEV1-SEV5). Higher severity indicates greater business impact and urgency.
Escalation
The process of involving additional resources, expertise, or management when an incident cannot be resolved at the current level or requires more authority.
On-call
A rotation system where engineers are designated to respond to incidents outside normal working hours. On-call engineers are the first responders to production issues.
Postmortem
A blameless analysis conducted after an incident to understand what happened, why it happened, and how to prevent similar incidents in the future. Also called a retrospective or incident review.
Root Cause Analysis (RCA)
A systematic process for identifying the underlying causes of an incident or problem, rather than just addressing symptoms.
Runbook
A documented set of procedures for handling specific types of incidents or operational tasks. Runbooks enable consistent, repeatable responses to known issues.
Automated Incident Response
The use of software systems to detect, investigate, and resolve production incidents with minimal or no human intervention. Ranges from automated runbook execution to fully autonomous AI agents that handle the complete incident lifecycle.
Proactive Incident Management
The practice of identifying and addressing the conditions that cause incidents before they impact users, rather than waiting for failures and responding. Involves continuous telemetry analysis, anomaly detection, and risk surfacing before thresholds are breached.
SRE Concepts
Toil
Manual, repetitive, automatable work that scales linearly with service growth. Reducing toil is a core SRE objective to free up time for engineering work.
Chaos Engineering
The practice of intentionally introducing controlled failures into a system to test its resilience and identify weaknesses before they cause real incidents.
Observability
The ability to understand the internal state of a system by examining its external outputs (logs, metrics, traces). A system is observable if you can understand why it behaves the way it does.
MELT (Metrics, Events, Logs, Traces)
The four pillars of observability data that together provide a complete picture of system behavior and health.
Golden Signals
Four key metrics recommended by Google SRE for monitoring user-facing systems: Latency, Traffic, Errors, and Saturation.
Blast Radius
The scope or extent of impact when a failure occurs. Reducing blast radius through isolation and redundancy is a key reliability strategy.
Circuit Breaker
A design pattern that prevents cascading failures by stopping requests to a failing service after a threshold of failures is reached, allowing the service time to recover.
Graceful Degradation
A design approach where a system continues to operate with reduced functionality when some components fail, rather than failing completely.
Production Operations
Change Management
The process of controlling changes to production systems to minimize risk and ensure stability. Includes change review, approval, and rollback procedures.
Rollback
The process of reverting a system to a previous known-good state, typically after a failed deployment or to mitigate an incident.
Canary Deployment
A deployment strategy where changes are rolled out to a small subset of users or servers first, allowing issues to be detected before full rollout.
Blue-Green Deployment
A deployment strategy using two identical production environments. Traffic is switched from the current (blue) environment to the new (green) environment after validation.
Pager Fatigue
The exhaustion and reduced effectiveness that results from too many alerts or on-call incidents. A leading cause of burnout in operations teams.
Alert Noise
Non-actionable or low-value alerts that distract from real issues and contribute to pager fatigue. Reducing noise improves incident response effectiveness.
Day 2 Operations
Everything that happens to a production system after it is first deployed: patching, scaling, tuning, debugging, upgrading, monitoring, and securing. Day 2 operations represents approximately 90% of total system lifecycle costs.
Production Readiness
The practice of verifying that a service meets a defined set of reliability, observability, and operational standards before it is deployed to production. Typically assessed via a production readiness review (PRR) checklist.
Ready to improve your MTTR?
See how Production Ops Agent reduces mean time to recovery by 87% with autonomous incident investigation and resolution.