What are SLOs, SLAs, and SLIs
Definition
Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs), form the reliability measurement framework that underpins modern site reliability engineering. An SLI is a quantitative measurement of some aspect of the service’s behavior. It’s the raw metric that tells you how your service is performing from the user’s perspective. An SLO is a target value for an SLI. It’s the line you draw that separates “good enough” from “not good enough”. An SLA is a contractual commitment to customers about service performance, with consequences (usually financial) for not meeting it. SLAs are business documents, not engineering metrics.
Your users don’t care about your uptime percentage. They care that the checkout page loads, that their payment goes through, and that their order confirmation shows up. SLOs, SLAs, and SLIs are the framework for translating “users are happy” into measurable targets that engineering teams can track, alert on, and make decisions against.
These three acronyms, Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs), form the reliability measurement framework that underpins modern site reliability engineering. They were formalized by Google’s SRE team and have become standard practice across the industry. Understanding how they relate to each other is essential for any team that’s responsible for keeping production systems running.
Service Level Indicators (SLIs)
An SLI is a quantitative measurement of some aspect of the service’s behavior. It’s the raw metric that tells you how your service is performing from the user’s perspective.
Good SLIs measure things users actually experience:
- Availability: The proportion of requests that succeed (e.g., 99.95% of requests return a non-error response)
- Latency: How long requests take to complete (e.g., 95th percentile response time is under 200ms)
- Throughput: How many requests the system handles per unit of time
- Error rate: The proportion of requests that fail
- Correctness: The proportion of requests that return the right answer (relevant for data processing pipelines)
The key principle is that SLIs should reflect the user’s experience, not internal system metrics. CPU usage is not an SLI. CPU usage might correlate with latency, but latency is what users feel. The Google SRE Book emphasizes choosing SLIs that map to user happiness.
How to Measure SLIs
SLIs are typically measured at the boundary between your system and its users:
- Server-side: Measured from load balancer or application logs. Fast to implement, but doesn’t capture client-side rendering time or network latency.
- Client-side: Measured from the user’s browser or mobile app. More accurate reflection of user experience, but harder to collect and noisier.
- Synthetic monitoring: Automated tests that simulate user interactions from external locations. Good for catching availability issues, less useful for latency percentiles.
Most teams start with server-side SLIs and add client-side measurement as their observability practice matures.
Service Level Objectives (SLOs)
An SLO is a target value for an SLI. It’s the line you draw that separates “good enough” from “not good enough.”
Examples:
- “99.9% of requests will return successfully” (availability SLO)
- “95th percentile latency will be under 300ms” (latency SLO)
- “99.99% of data processing jobs will complete within 1 hour” (throughput SLO)
Setting the Right SLO
The most common mistake is setting SLOs too aggressively. A 99.99% availability target sounds impressive, but it means you can only afford 4.3 minutes of downtime per month. Every deployment, every maintenance window, every dependency hiccup eats into that budget.
Consider these tradeoffs when setting SLOs:
Higher SLOs mean less room for change. A 99.99% availability target means any deployment that causes even a brief error spike is a significant hit to your budget. This slows down deployment frequency, which the DORA research shows is a key indicator of engineering performance.
SLOs should match user expectations. An internal admin tool used by 10 people doesn’t need the same availability as a customer-facing checkout flow. Set SLOs based on the actual impact of the service being unavailable.
SLOs should be achievable. Setting an SLO you consistently miss is worse than having no SLO at all. It trains the team to ignore the metric. Start conservatively and tighten over time as you build confidence.
Error Budgets
The error budget is the difference between 100% and your SLO target. If your SLO is 99.9% availability, your error budget is 0.1%, which translates to about 43 minutes of allowed downtime per month.
Error budgets are one of the most powerful concepts in SRE. They turn reliability from an abstract goal into a concrete, spendable resource:
- If your error budget is healthy, ship features faster, deploy more aggressively, run experiments
- If your error budget is depleted, slow down deployments, focus on reliability work, investigate what’s consuming the budget
This reframes the relationship between feature development and reliability. Instead of “we can’t deploy because we might break things,” it becomes “we have budget remaining, so the risk of deploying is acceptable” or “we’ve spent our budget, so we need to focus on stability before shipping more.”
Service Level Agreements (SLAs)
An SLA is a contractual commitment to customers about service performance, with consequences (usually financial) for not meeting it. SLAs are business documents, not engineering metrics.
Key differences from SLOs:
| Aspect | SLO | SLA |
|---|---|---|
| Audience | Internal engineering team | External customers |
| Consequence of violation | Team focuses on reliability work | Financial credits, contract penalties |
| Typical target | 99.9% or higher | Lower than SLO (e.g., 99.5% if SLO is 99.9%) |
| Who sets it | Engineering/SRE team | Business/legal team |
The most important rule: your SLA should always be less strict than your SLO. If your internal target is 99.9% and your contractual commitment is also 99.9%, any SLO violation immediately becomes an SLA violation with financial consequences. A buffer (e.g., SLO at 99.9%, SLA at 99.5%) gives the engineering team room to address reliability issues before they become contractual problems.
How SLOs, SLAs, and SLIs Work Together
The three concepts form a hierarchy:
- SLIs are the measurements (latency is 150ms at P95)
- SLOs are the targets (P95 latency should be under 300ms)
- SLAs are the promises (we guarantee P95 latency under 500ms, or we issue credits)
In practice:
- Engineers instrument services to capture SLIs
- SRE teams set SLOs based on user expectations and operational capacity
- Business teams negotiate SLAs with customers based on SLOs (with a safety margin)
- Error budgets drive decision-making about deployment velocity vs. reliability investment
- Alerting triggers when SLI measurements approach SLO thresholds (burn-rate alerting)
SLO-Based Alerting
Traditional alerting fires when a metric crosses a static threshold (error rate > 1%). SLO-based alerting fires when the rate of SLO violation threatens to exhaust the error budget within a defined window.
For example: instead of alerting when error rate exceeds 1%, alert when the current error rate, if sustained, would consume the entire monthly error budget within 6 hours. This approach produces fewer alerts (minor, transient spikes don’t trigger) and more actionable ones (every alert represents a real threat to your reliability target).
Common Pitfalls
Too many SLOs. If everything has an SLO, nothing is prioritized. Focus on the 3-5 SLIs that most directly reflect user experience.
SLOs without error budgets. An SLO without an error budget is just a number. The error budget is what makes SLOs actionable by connecting reliability to deployment decisions.
Not reviewing SLOs regularly. User expectations change. Traffic patterns shift. An SLO that was appropriate last year might be too loose (or too tight) today. Review quarterly.
Internal SLOs treated as SLAs. When leadership treats an internal SLO violation as a failure rather than a signal to invest in reliability, it creates the same perverse incentives as hard SLAs: teams game the metrics instead of improving the system.
How AI Supports SLO Management
AI-driven platforms help with SLO management in two areas. First, by reducing MTTR when SLO violations do occur, preserving error budget that would otherwise be consumed by slow incident response. NeuBird AI compresses investigation time, which directly translates to less error budget spent per incident.
Second, AI can proactively identify trends that threaten SLOs before they trigger violations. By analyzing telemetry patterns, an AI agent can surface a gradual latency increase that will breach the SLO within days, giving the team time to address it during business hours rather than during a 3 AM incident .
Key Takeaways
- SLIs are measurements (what you measure), SLOs are targets (what you aim for), and SLAs are contractual promises (what you guarantee to customers).
- Error budgets turn SLOs into actionable decision-making tools: spend budget on velocity when it’s healthy, focus on reliability when it’s depleted.
- SLAs should always be less strict than SLOs to provide a safety buffer.
- SLO-based alerting produces fewer, more actionable alerts than static threshold alerting.
- Focus on 3-5 SLIs that reflect user experience (availability, latency, correctness), not internal metrics like CPU usage.
Related Reading
- What is Observability? – The data collection foundation that SLIs depend on.
- What is Alert Fatigue? – Why SLO-based alerting produces better signal than threshold-based alerting.
- What is MTTR (Mean Time to Resolution)? – Fast incident resolution preserves error budget.
- Google SRE Book: Service Level Objectives – The foundational reference on SLOs, SLIs, and error budgets.
Frequently Asked Questions
What's the difference between SLO, SLA, and SLI? +
SLI (Service Level Indicator) is a measurement, like “request latency is 150ms.” SLO (Service Level Objective) is an internal target, like “P95 latency should be under 300ms.” SLA (Service Level Agreement) is a contractual promise to customers with penalties for violations.
Should my SLA be the same as my SLO? +
No. Your SLA should always be less strict than your SLO to provide a safety buffer. If both are set to 99.9% and you violate the SLO, you immediately violate the SLA. A buffer (e.g., SLO at 99.9%, SLA at 99.5%) gives engineering time to address issues before they become contractual problems.
What is an error budget? +
An error budget is the difference between 100% and your SLO target. If your SLO is 99.9%, your error budget is 0.1%, which equals about 43 minutes of allowed downtime per month. Error budgets turn reliability into a spendable resource that informs deployment decisions.
Is 99.99% availability always better than 99.9%? +
Not necessarily. Higher SLOs leave less room for change, which slows down deployment frequency. The right SLO depends on user expectations and business impact. Most user-facing services don’t need the same availability as a payment processor.
How do I choose good SLIs? +
Choose measurements that reflect user experience. Availability (proportion of successful requests), latency (how long requests take), throughput (request volume), and correctness (right answers) are common choices. Internal metrics like CPU usage are not good SLIs because users don’t experience them directly.
What is SLO-based alerting? +
SLO-based alerting fires when the rate of SLO violation threatens to exhaust the error budget within a defined window. This produces fewer, more meaningful alerts than static threshold alerting. Minor transient spikes don’t trigger; only sustained issues do.
How often should I review SLOs? +
Quarterly is a good baseline. User expectations change, traffic patterns shift, and an SLO that was appropriate last year might be too loose or too tight today. Review after major incidents and during regular planning cycles.
What is 99.9% SLA in minutes? +
99.9% availability allows about 43 minutes of downtime per month, or roughly 8.76 hours per year. Each additional 9 cuts allowed downtime by a factor of 10: 99.99% allows 4.3 minutes per month, and 99.999% allows just 26 seconds per month. The cost and difficulty of additional 9s grows exponentially.
Is uptime an SLI or SLO? +
Uptime can be either, depending on context. As a measurement (the actual percentage of time the service was available), uptime is an SLI. As a target (we aim for 99.9% uptime), uptime is an SLO. As a contractual commitment (we guarantee 99.5% uptime or you get credits), it’s an SLA.
Who is responsible for SLOs? +
SLO ownership varies by organization. In SRE-mature companies, the SRE team typically defines SLOs in collaboration with product owners and engineering leadership. The service owning team is responsible for meeting them. Without clear ownership, SLOs tend to drift or be ignored, so explicit accountability matters.