Glossary/What are SLOs, SLAs, and SLIs

What are SLOs, SLAs, and SLIs

Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) form the reliability measurement framework that underpins modern site reliability engineering. An SLI measures service behavior quantitatively, an SLO sets target values for SLIs, and an SLA represents contractual commitments with financial consequences for non-compliance.

01

Service Level Indicators (SLIs)

An SLI provides a quantitative measurement of some aspect of the service's behavior from the user perspective. Good SLIs measure user-experienced aspects: Availability (proportion of successful requests), Latency (request completion time, e.g., P95 under 200ms), Throughput (requests handled per time unit), Error rate (proportion of failed requests), and Correctness (accuracy of returned data). SLIs should reflect user experience, not internal metrics like CPU usage. Measurement occurs at three boundaries: server-side (load balancer/logs), client-side (browser/mobile app), or synthetic monitoring (automated external tests).

02

Service Level Objectives (SLOs) and Error Budgets

An SLO represents a target value for an SLI that distinguishes acceptable from unacceptable performance. Examples: "99.9% of requests return successfully" or "P95 latency under 300ms." Higher SLOs reduce deployment flexibility and slow feature velocity; SLOs should align with actual user impact. Error budget equals the difference between 100% and the SLO target. A 99.9% SLO provides approximately 43 minutes of monthly downtime allowance. Error budgets transform reliability into a spendable resource enabling teams to balance deployment speed against stability needs.

03

Service Level Agreements (SLAs)

An SLA constitutes a contractual commitment to customers about service performance, with consequences (usually financial) for not meeting it. SLAs function as business documents rather than engineering metrics. Key distinction: SLAs must always be less strict than SLOs to provide engineering buffer before contractual violations occur. SLO-based alerting fires when violation rates threaten budget exhaustion, producing fewer but more actionable alerts than static threshold approaches.

Key Takeaways

What to remember

  1. 1SLIs are measurements (what you measure), SLOs are targets (what you aim for), and SLAs are contractual promises
  2. 2Error budgets enable actionable deployment decisions
  3. 3SLAs must maintain less stringency than SLOs for safety buffering
  4. 4SLO-based alerting generates fewer, more meaningful alerts
  5. 5Focus on 3–5 user-experience SLIs rather than internal infrastructure metrics
FAQ

Frequently asked questions

What's the difference between SLO, SLA, and SLI?

SLI measures performance (e.g., request latency at 150ms); SLO sets internal targets (P95 latency under 300ms); SLA provides contractual customer promises with violation penalties.

Should my SLA be the same as my SLO?

No. SLAs must be less strict than SLOs. A buffer prevents immediate SLA violations from SLO breaches, allowing remediation time.

What is an error budget?

The difference between 100% and SLO target, quantifying allowable downtime. A 99.9% SLO permits roughly 43 monthly downtime minutes.

Is 99.99% availability always better than 99.9%?

Not necessarily. Higher targets reduce deployment flexibility; the appropriate SLO depends on actual user expectations and business impact.

How do I choose good SLIs?

Select measurements reflecting user experience (availability, latency, throughput, correctness) rather than internal metrics.

What is SLO-based alerting?

Alerting triggered when violation rates threaten error budget exhaustion within defined windows, producing fewer transient false alerts.

How often should I review SLOs?

Quarterly baseline reviews, plus reviews following major incidents and planning cycles.

What is 99.9% SLA in minutes?

Approximately 43 monthly downtime minutes, or 8.76 annual hours. Each additional nine reduces allowance tenfold.

Who is responsible for SLOs?

SRE teams typically define SLOs collaboratively with product owners; service teams ensure compliance. Clear ownership prevents drift.

See it in action. No slides.

NeuBird AI compresses incident investigation from hours to minutes: autonomous root cause analysis, with zero manual triage.

We use cookies for analytics and marketing. Privacy Policy