SLI vs SLO vs SLA: Understanding the Differences

SLI, SLO, and SLA are three terms that get thrown around in monitoring and reliability discussions, often interchangeably. They are not the same thing. Each one serves a different purpose, and understanding the distinction will change how you think about uptime and reliability.

Here is the short version: an SLI is a measurement, an SLO is a target, and an SLA is a contract. They build on each other in that order. For a broader look at monitoring practices, see the uptime monitoring guide.

SLI: Service Level Indicator

An SLI is a metric that measures some aspect of your service's performance. It is a number. Nothing more.

Common SLIs for a website include:

Availability: The percentage of requests that return a successful response (not a 5xx error).
Latency: How long it takes to respond to a request, often measured at a specific percentile (p50, p95, p99).
Error rate: The percentage of requests that result in errors.
Throughput: The number of requests processed per second.

An SLI is not a goal or a promise. It is what you observe. When you check your monitoring dashboard and see that your site responded successfully to 99.95% of requests last month, that number is your availability SLI.

Choosing Good SLIs

Not every metric makes a good SLI. The best SLIs are:

Directly tied to user experience. CPU usage on your server is an interesting metric, but it does not directly tell you whether users are having a good experience. Availability and latency do.

Measurable and objective. "The site feels fast" is not an SLI. "95% of requests complete in under 200ms" is.

Actionable. If the SLI degrades, you should be able to investigate and fix the underlying cause. A metric you cannot act on is just noise.

For a typical website or web application, the two most important SLIs are availability (did the request succeed?) and latency (how long did it take?). For more on how availability maps to real-world downtime, see uptime nines explained.

Measuring SLIs

You can measure SLIs from two perspectives:

Server-side: Your application logs or monitoring agent records every request and its outcome. This gives you the most detailed data but only measures what your server sees.

Client-side (synthetic monitoring): External checks hit your site from multiple locations at regular intervals. This measures the experience from the user's perspective, including DNS resolution, network routing, and CDN behavior.

Most teams use both. Server-side metrics give you granular data for debugging. Synthetic monitoring gives you the ground truth of whether your site is reachable.

SLO: Service Level Objective

An SLO is a target value for an SLI. It says "we aim for this metric to stay above (or below) this threshold."

Examples:

Availability SLO: 99.9% of requests will return a successful response, measured monthly.
Latency SLO: 95% of requests will complete in under 300ms, measured over a rolling 7-day window.
Error rate SLO: Fewer than 0.1% of requests will return a 5xx error.

An SLO is an internal commitment. It is the target your team sets for itself. It is not a promise to customers (that is the SLA). You can have multiple SLOs for different SLIs, and you can set them at different levels for different services based on their importance.

Setting Good SLOs

The biggest mistake teams make with SLOs is setting them too high. A 100% availability SLO sounds great until you realize it means zero tolerance for any failure, ever. No deployment, no maintenance window, no infrastructure hiccup can cause a single failed request. That is not realistic, and chasing it wastes engineering time that could be spent on features.

Here is a practical framework for setting SLOs:

Start with user expectations. How much downtime would your users actually notice or care about? For an e-commerce site, an hour of downtime during business hours is a serious problem. For an internal reporting tool checked once a week, an hour of downtime might not matter at all.

Look at your current performance. If your SLI shows 99.8% availability over the past six months, setting a 99.99% SLO is aspirational but probably not achievable without significant infrastructure investment. Start with something slightly above your current baseline, like 99.9%.

Factor in your dependencies. Your service cannot be more reliable than its least reliable critical dependency. If your database provider offers a 99.95% SLA, setting a 99.99% availability SLO for your application is unrealistic unless you have redundancy across providers. To see how uptime percentages translate to real downtime, see how to calculate uptime.

Differentiate by importance. Your checkout page probably needs a tighter SLO than your blog. Your API endpoints serving paying customers need a tighter SLO than your free tier. Not everything needs to be 99.99%.

| SLO Target | Monthly Downtime Budget | Good For | |------------|------------------------|----------| | 99% | 7 hours 18 minutes | Internal tools, staging environments | | 99.5% | 3 hours 39 minutes | Low-traffic services, blogs | | 99.9% | 43 minutes 50 seconds | Production web applications | | 99.95% | 21 minutes 55 seconds | E-commerce, SaaS products | | 99.99% | 4 minutes 23 seconds | Critical infrastructure, payment systems |

Track your SLOs with real data

Uptime Monitor checks your site every minute from multiple locations, giving you the availability data you need to measure against your SLOs.

Try Uptime Monitor

SLA: Service Level Agreement

An SLA is a formal contract between a service provider and a customer. It defines what level of service the provider commits to delivering, and what happens if they fail to deliver it.

The key difference between an SLO and an SLA: an SLO is an internal target with no contractual consequences. An SLA is a binding agreement with financial penalties.

What an SLA Typically Includes

Uptime commitment: "We guarantee 99.9% monthly uptime."
How uptime is measured: The specific methodology, exclusions (scheduled maintenance, force majeure), and measurement window.
Remedies for violations: Usually service credits. "If uptime falls below 99.9% in a calendar month, the customer receives a 10% credit on their monthly invoice."
Exclusions: What does not count. Most SLAs exclude scheduled maintenance, customer-caused issues, and events outside the provider's control.
Reporting and claims process: How the customer requests credits and what evidence they need to provide.

SLA Examples from Real Providers

AWS EC2: 99.99% monthly uptime. 10% credit if uptime falls between 99.0% and 99.99%. 30% credit if below 99.0%.

Google Cloud Compute Engine: 99.99% monthly uptime for multi-zone deployments. 10-50% credits depending on the severity of the miss.

Azure Virtual Machines: 99.99% for multi-instance deployments. 10-100% credits depending on actual uptime.

Notice the pattern: the guarantees are high (99.99%), but the credits are modest (10-30% of your bill for that month). A major outage that costs your business thousands of dollars in lost revenue might earn you $50 in hosting credits. SLAs protect the provider more than they protect you. For a deeper look at SLA terms and what they actually mean, see the uptime SLA guide.

SLA vs SLO: Why You Need Both

Your hosting provider's SLA is their promise to you. Your SLO is your promise to yourself (and your customers, if you publish one).

These should not be the same number. Your SLO should be tighter than your provider's SLA, because your SLA to your customers needs headroom for issues that are not your hosting provider's fault -- application bugs, deployment mistakes, DNS misconfiguration, third-party service failures.

If your hosting provider guarantees 99.95% uptime, and you set your own SLO at 99.95%, you have zero margin for anything going wrong on your side. Set your internal SLO at 99.9% and work toward 99.95%, keeping the buffer for the things you can control.

How SLIs, SLOs, and SLAs Relate

The three concepts form a chain:

SLIs measure your service. They are the raw data.
SLOs set targets for your SLIs. They define "good enough."
SLAs formalize SLOs into contracts with consequences.

Here is a concrete example for a website:

| Layer | Example | |-------|---------| | SLI | 99.93% of requests returned a 2xx or 3xx status code this month | | SLO | We target 99.9% availability, measured monthly | | SLA | We guarantee 99.5% availability; credits issued if we miss it |

The SLI tells you where you are. The SLO tells you where you need to be. The SLA tells you what happens if you fall too far.

Error Budgets

An error budget is the gap between 100% and your SLO. If your SLO is 99.9% availability, your error budget is 0.1% -- roughly 43 minutes of downtime per month.

The error budget concept comes from Google's Site Reliability Engineering (SRE) practices. The idea is powerful: instead of treating every outage as a crisis, you accept that some amount of failure is normal and expected. Your error budget is the amount of failure you can tolerate while still meeting your SLO.

How Teams Use Error Budgets

Deployment decisions. If you have plenty of error budget remaining, you can deploy aggressively -- ship features faster, run experiments, take controlled risks. If your error budget is nearly exhausted, you slow down and focus on stability.

Prioritization. When the error budget is healthy, product work takes priority. When the error budget is burning, reliability work takes priority. This gives engineering teams a data-driven way to balance feature development against reliability.

Incident response. Not every outage requires the same response. A 2-minute blip when you have 40 minutes of budget remaining is worth investigating but not worth a war room. A 30-minute outage when you have 10 minutes of budget remaining is a serious problem that needs immediate attention.

Error Budget Example

Your SLO is 99.9% monthly availability. That gives you about 43 minutes of downtime budget for the month.

On April 5, a deployment causes 8 minutes of errors. On April 12, a database failover takes 15 minutes. On April 18, a third-party API outage causes 12 minutes of degraded service. Budget remaining: about 8 minutes, and you are only three-quarters through the month.

The team should freeze risky deployments, investigate the recurring issues, and focus on stability. Without an error budget, these decisions are subjective. With one, they are based on data. For more on the metrics that matter during incidents, see incident response metrics.

Practical Steps to Get Started

If you do not currently track SLIs, SLOs, or error budgets, here is a simple path to start:

Pick one or two SLIs. Availability and latency are the best starting points.
Measure for a month. Before setting an SLO, you need a baseline. Run monitoring for at least 30 days.
Set a realistic SLO. Set a target slightly better than your current performance. If you measured 99.7%, aim for 99.9%.
Calculate your error budget. Subtract your SLO from 100%. Track how much budget you consume each month.
Review monthly. Check whether you met your SLO and how much error budget you used. Adjust over time.

You do not need expensive tooling. A simple uptime monitor checking your site every minute gives you the availability SLI. A monthly spreadsheet tracking SLO compliance gives you the error budget. Start simple.

Get the availability data your SLOs need

Uptime Monitor checks your site every minute from multiple locations. Get accurate availability numbers to measure against your SLOs and track your error budget.

Try Uptime Monitor

References

Beyer, B., Jones, C., Petoff, J., Murphy, N.R., "Site Reliability Engineering," O'Reilly Media, https://sre.google/sre-book/table-of-contents/