SLA Monitoring: How to Track Uptime Promises
How to monitor SLA compliance: what to track, how to verify provider claims, tools and approaches for SLA monitoring, and how to request SLA credits with evidence.
Your hosting provider promises 99.9% uptime. Your CDN claims 99.99%. Your DNS provider says "100% SLA." But are they actually delivering? Unless you are measuring independently, you are taking their word for it. And when it comes time to claim credits for an SLA breach, "it felt like the site was down a lot" will not get you very far.
SLA monitoring means tracking your providers' actual performance against the uptime and performance guarantees they have committed to in their service level agreements. It gives you the data to verify claims, the evidence to request credits, and the visibility to make informed decisions about your infrastructure. For background on how SLAs work, see the uptime SLA guide.
Why You Need Independent SLA Monitoring
Provider Dashboards Are Not Enough
Every major hosting and cloud provider has a status page. AWS has status.aws.amazon.com. Google Cloud has status.cloud.google.com. These pages are useful, but they have a fundamental conflict of interest: the provider is reporting on its own performance. Status pages are often slow to acknowledge issues, quick to declare them resolved, and sometimes do not reflect problems that affected only a subset of customers.
During AWS's major us-east-1 outage in December 2021, the status dashboard itself was partially down because it depended on the same infrastructure. Customers knew their services were broken, but the official status page showed green for hours.
You Need Your Own Numbers
When you monitor independently, you get:
- Objective data that is not filtered through the provider's reporting.
- Faster detection of issues affecting your specific services.
- Historical records you control, useful for SLA credit claims and vendor evaluations.
- Coverage of your full stack, not just individual provider components.
Your website's uptime depends on a chain of services: DNS, CDN, hosting, databases, third-party APIs. A provider's status page only covers their piece. Your monitoring covers the whole experience. For a broader look at monitoring practices, see the uptime monitoring guide.
What to Track for SLA Compliance
Uptime Percentage
This is the most common SLA metric. Uptime is typically defined as the percentage of time your service responds successfully within a measurement window (usually a calendar month).
Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) x 100
A month has roughly 43,200 minutes (30 days). At 99.9% uptime, the allowed downtime is about 43 minutes per month. At 99.99%, it is about 4 minutes. The difference between those two numbers sounds small in percentage terms but is enormous in practice. See uptime nines explained for the full breakdown.
When tracking uptime for SLA purposes, be precise about:
- Measurement interval. How often are you checking? Every minute? Every 5 minutes? A 5-minute check interval means you could miss a 4-minute outage entirely, or record a 1-minute outage as 5 minutes.
- What counts as "down." Is a timeout "down"? Is a 500 error "down"? Is a 200 response that takes 30 seconds "down"? Define this before you start measuring.
- Measurement locations. A site that is reachable from Virginia but unreachable from London is partially down. Multi-location monitoring gives you a more accurate picture.
Response Time
Some SLAs include latency commitments, though these are less common than uptime guarantees. Track p50 (median), p95, and p99 response times. The median tells you about the typical experience. The p95 and p99 tell you about the worst experiences your users are having.
Typical SLA latency clause:
"95% of requests will complete within 500ms"
Response time tracking also helps you detect degradation before it becomes a full outage. A server that is responding but taking 10 seconds per request is technically "up" but functionally broken.
Error Rate
Track the percentage of requests that return 4xx and 5xx status codes. Focus on 5xx errors for SLA purposes -- these indicate server-side failures. A spike in 500 errors is often the first sign of an infrastructure problem, even if the site is still technically responding.
Specific Service Metrics
Depending on what SLAs you hold, you may need to track:
- DNS resolution time and availability -- for DNS provider SLAs.
- SSL certificate validity -- if your CDN or hosting SLA covers certificate management.
- CDN cache hit rates and edge response times -- for CDN provider SLAs.
- Database connection availability and query latency -- for managed database SLAs.
How to Monitor SLAs
Synthetic Monitoring (External Checks)
Synthetic monitoring sends automated requests to your site from external locations at regular intervals. This is the most common and most straightforward approach to SLA monitoring.
A typical synthetic monitoring setup:
- Checks every 1 minute from 3 or more geographic locations.
- Records response status code, response time, and SSL certificate status.
- Alerts immediately when a check fails from multiple locations (to avoid false positives from single-location network issues).
- Stores historical data for reporting and SLA credit claims.
For SLA monitoring specifically, the check interval matters. If your SLA is measured in minutes of downtime, you need minute-level checks to have accurate data. A 5-minute check interval gives you at best a 5-minute granularity, which is too coarse for a 99.99% SLA where your total monthly budget is only 4 minutes.
Multi-location checks reduce false positives
A single monitoring location might report downtime due to a local network issue that has nothing to do with your provider. Require failures from at least 2 out of 3 locations before counting an incident as real downtime. This gives you more accurate SLA compliance data and fewer false alarms.
Real User Monitoring (RUM)
RUM collects performance data from actual user sessions. It captures real-world latency, error rates, and availability as experienced by your visitors. RUM is valuable for understanding the user experience but is less useful for SLA credit claims because providers typically define uptime based on server-side availability, not client-side performance.
RUM complements synthetic monitoring rather than replacing it. Synthetic monitoring tells you "is the site up?" RUM tells you "how are users experiencing the site?"
Log-Based Monitoring
If you have access to server logs or application logs, you can calculate uptime and error rates from the logs themselves. This is useful as a secondary data source but has limitations:
- Logs only capture requests that reach your server. If DNS or the network is down, there are no log entries.
- Log processing introduces delays. You typically get insights minutes or hours after the fact, not in real time.
- Log storage and analysis requires its own infrastructure.
For SLA monitoring, use log-based analysis as a supplement to synthetic monitoring, not as a replacement.
SLA Credit Claims: The Process
When your provider misses their SLA, you are entitled to credits. But credits are not automatic -- you almost always need to file a claim.
Step 1: Document the Incident
As soon as you detect an outage or degradation, start documenting:
- Start time of the incident (in UTC, as most SLAs use UTC).
- End time when service was fully restored.
- Duration in minutes.
- Evidence: screenshots of your monitoring dashboard, alert notifications, response time graphs, error logs.
- Impact: which services were affected and how.
Your monitoring tool's historical data is your primary evidence. This is why storing monitoring history matters -- without it, your claim is "we think we were down for an hour" versus the provider's "our logs show 2 minutes of impact."
Step 2: Review the SLA Terms
Before filing, re-read the SLA carefully. Pay attention to:
- Measurement methodology. How does the provider define and measure uptime? It may differ from how you measure it.
- Exclusions. Scheduled maintenance, customer-caused issues, and force majeure events are typically excluded.
- Claim window. Most SLAs require claims within 30 days of the incident. Miss the window and you forfeit the credit.
- Minimum threshold. Some SLAs only pay credits if downtime exceeds a certain amount.
Step 3: File the Claim
Most providers have a support ticket or form for SLA credit requests. Include:
- Your account ID and affected services.
- Incident start and end times with timezone.
- Duration of the outage.
- The specific SLA commitment that was missed.
- Your monitoring evidence (screenshots, CSV exports, links to monitoring reports).
Step 4: Follow Up
Providers do not always respond quickly to credit claims. If you do not hear back within a week, follow up. If the initial response is a denial, push back with specific data. Providers sometimes deny claims based on their own monitoring data, which may not reflect the experience from your perspective.
The Reality of SLA Credits
SLA credits are better than nothing, but they are rarely proportional to the actual business impact. A major outage that costs you thousands in lost revenue and damaged reputation might earn you a 10-30% credit on that month's hosting bill -- maybe $50-100 for a small business.
Think of SLA credits as a signal, not compensation. If you are frequently claiming credits, the provider is not meeting your needs and it is time to evaluate alternatives. For a framework on evaluating provider reliability, see this vendor reliability scorecard. For a look at the real business impact of downtime, see cost of website downtime.
Building an SLA Monitoring Practice
For Small Businesses
You do not need a complex setup. A synthetic monitoring tool checking your site every minute from multiple locations gives you the core data. Export monthly uptime reports and compare them against your providers' SLA commitments. File credit claims when the numbers do not match.
Key things to set up:
- 1-minute check intervals for your main website and any critical endpoints.
- Alerts via email and SMS so you know about outages in real time.
- Monthly uptime reporting to track trends over time.
- A calendar reminder to review SLA compliance monthly and file claims within the window.
For Mid-Size and Enterprise
Larger organizations typically need:
- Monitoring across more endpoints (APIs, staging, internal services).
- Integration with incident management tools.
- Automated SLA compliance reporting for multiple providers.
- Historical data retention for annual vendor reviews.
- Multi-region monitoring to detect geographic-specific issues.
The core principle is the same: measure independently, compare against commitments, and use the data for vendor management decisions.
Choosing a Monitoring Approach
When selecting a tool for SLA monitoring, prioritize:
- Check frequency. 1-minute intervals or better for accurate SLA measurement.
- Multiple check locations. At least 3 geographic regions.
- Historical data. You need months of history for SLA claims and vendor reviews.
- Uptime reporting. The ability to generate monthly uptime percentages, ideally with exportable data.
- Alerting. Real-time notifications so you can document incidents as they happen.
The specific tool matters less than using it consistently. Any monitoring tool that checks every minute from multiple locations and stores historical data will give you what you need for SLA tracking. What matters is that you set it up, leave it running, and actually review the data.
Track your providers' uptime promises
Uptime Monitor checks your site every minute from multiple locations. Get accurate uptime data to verify SLA compliance and support credit claims.
Try Uptime MonitorReferences
- Gartner, "The Cost of Downtime," https://www.gartner.com/en/documents/3956079