Downtime Guide

Every website goes down eventually. The question is not whether it will happen but how often, how long, and how much it will cost you when it does.

Downtime is not a single problem. It is a category of problems with different causes, different impacts, and different solutions. A planned maintenance window at 3 AM on a Sunday is downtime. So is a catastrophic server failure during Black Friday. Treating them the same way leads to bad decisions about where to invest your time and money.

This guide breaks downtime into its component parts. You will learn what types of downtime exist, what causes each type, what it actually costs your business, how to prevent it, and how to measure your response when prevention fails. Whether you run a personal blog or a platform serving millions of users, the principles are the same. The scale changes, but the fundamentals do not.

Types of downtime

Not all downtime looks the same. Understanding the distinctions helps you prioritize your response and communicate clearly with stakeholders.

Planned vs. unplanned downtime

Planned downtime is scheduled in advance. Server maintenance, database migrations, software upgrades, and infrastructure changes all fall into this category. You control the timing, you can notify users ahead of time, and you have a rollback plan if something goes wrong.

Unplanned downtime is everything else. A server crashes. A certificate expires. A DNS record gets misconfigured. A deployment introduces a bug that takes down the application. Unplanned downtime is more expensive per minute because it arrives without warning and without a plan.

The distinction matters for SLA calculations. Most service level agreements exclude planned maintenance windows from availability calculations, as long as the provider gives adequate notice. Unplanned downtime counts against your uptime percentage.

Total vs. partial downtime

Total downtime means your site is completely unreachable. Nobody can load any page. The server is not responding, the DNS is broken, or the network path is severed. This is the most visible and most urgent type of failure.

Partial downtime is more subtle and often more insidious. Your homepage loads but the checkout page throws a 500 error. Your API responds to read requests but write operations time out. Your site works for users in North America but is unreachable from Europe.

Partial downtime is harder to detect with basic monitoring. A simple ping check will report your site as "up" even if half your application is broken. This is why content-based monitoring and multi-endpoint checks matter.

Regional vs. global downtime

A CDN node failure in Frankfurt takes your site offline for European users while everyone else is unaffected. A routing issue at a specific ISP makes your site unreachable for that ISP's customers. A DNS propagation problem causes resolution failures in certain regions.

Regional downtime is common and frequently goes undetected because the team managing the site is often in a region where everything works fine. Multi-location monitoring catches regional issues that single-location monitoring misses entirely.

What causes downtime

Downtime has a root cause, and understanding the most common ones helps you prioritize your prevention efforts. Here are the major categories, roughly ordered by how frequently they cause real-world outages.

Server and infrastructure failures

Hardware fails. SSDs wear out, RAM develops errors, power supplies die, and network interfaces go bad. In cloud environments, the underlying physical hardware can still fail, taking your virtual machines with it.

Beyond hardware, server-level issues include:

Resource exhaustion. CPU at 100%, memory full, disk space consumed by logs, file descriptors exhausted. Any of these will degrade or crash your application.
Process crashes. Your web server, application server, or database process terminates unexpectedly. Without process supervision and auto-restart, a single crash becomes sustained downtime.
Operating system issues. Kernel panics, failed OS updates, filesystem corruption, and clock drift can all take a server offline.
Configuration errors. A typo in an Nginx config file, a misconfigured firewall rule, or an incorrect environment variable can take down a perfectly healthy server.

DNS failures

DNS is the foundation of how users reach your website. When DNS resolution fails, your site is effectively nonexistent, regardless of whether the server itself is running perfectly.

Common DNS failure scenarios include:

Nameserver outages. If your DNS provider goes down and you have not configured secondary nameservers, your domain stops resolving. This has happened to major providers, causing widespread outages for thousands of sites simultaneously.
Misconfigured records. Pointing an A record to the wrong IP, deleting a CNAME accidentally, or setting incorrect TTL values can all break resolution.
Expired domains. When a domain expires, DNS resolution stops. This is one of the most preventable causes of downtime.
Propagation delays. DNS changes can take hours to propagate globally. During that window, some users reach the old destination while others reach the new one, and some reach neither.

For diagnosing DNS issues, see how to check DNS records. Dedicated DNS monitoring catches these problems before they affect users.

SSL and TLS failures

An expired SSL certificate does not just show a warning. It blocks users entirely. Modern browsers refuse to load pages with invalid certificates, displaying a full-screen error that most visitors will not click through.

SSL-related downtime causes include expired certificates, incomplete certificate chains, misconfigured TLS settings, and certificate/key mismatches. Auto-renewal helps but fails silently more often than most teams realize.

DDoS attacks

Distributed denial-of-service attacks flood your server with traffic designed to overwhelm its capacity. Volumetric attacks saturate your bandwidth. Application-layer attacks exploit expensive operations like database queries or form submissions. Protocol attacks target weaknesses in network protocols.

DDoS attacks have become cheaper to launch and more common. Even modest attacks can overwhelm a server that is not behind a DDoS mitigation service. The attack does not need to be sophisticated if your infrastructure has no protection.

Deployment failures

Code deployments are one of the most frequent causes of unplanned downtime. A bug that was not caught in testing, a database migration that locks tables for too long, a dependency update that introduces an incompatibility, or a configuration change that does not match the production environment.

The risk compounds when teams deploy infrequently, because each deployment contains more changes and more potential failure points. Smaller, more frequent deployments are generally safer because each one contains less risk and is easier to roll back.

Hosting provider outages

Your infrastructure depends on your hosting provider's infrastructure. When AWS has a regional outage, every service running in that region goes down. When a shared hosting provider has a problem, every site on the affected server goes down.

Cloud provider outages are rare but impactful. The major providers (AWS, Google Cloud, Azure) each experience multiple significant incidents per year. If your architecture is single-region, a regional outage means total downtime.

Domain expiry

A domain that expires silently is one of the most frustrating causes of downtime because it is entirely preventable. Payment methods expire, renewal emails go to former employees, and auto-renew fails because the registrar account has a billing problem.

The result is that your perfectly healthy server becomes unreachable because the domain name no longer points to it. See the vendor outage response playbook for handling situations where your outage depends on a third party.

Database failures

Databases are the backbone of dynamic websites. When the database goes down or becomes unresponsive, every page that depends on it fails, which usually means every page.

Common database failure scenarios include:

Connection pool exhaustion. The application opens more database connections than the server can handle. New requests queue up and eventually time out.
Slow queries. A single expensive query can lock tables and block all other operations. One bad query in a rarely-used admin panel can bring down the entire site.
Replication lag. In read-replica architectures, excessive replication lag means read replicas serve stale data or fall so far behind that they disconnect from the primary.
Disk full. Databases write transaction logs, temporary files, and data pages to disk. When disk space runs out, the database crashes.
Corruption. Power failures, hardware errors, or software bugs can corrupt database files, requiring recovery from backups.

Third-party dependency failures

Modern websites depend on external services: payment processors, authentication providers, CDNs, analytics platforms, CMS APIs, and more. When any of these fail, your site may partially or completely break, even though your own infrastructure is healthy.

The more external dependencies you have, the more potential points of failure exist. Each dependency has its own uptime track record, and your effective uptime is the product of all of them. A site that depends on five third-party services, each with 99.9% uptime, has a theoretical maximum availability of about 99.5% from those dependencies alone.

Mitigate third-party risk by implementing timeouts and circuit breakers for external calls, caching responses where possible, and designing graceful degradation paths so that a failed dependency degrades the experience rather than taking down the entire page.

The cost of downtime

Downtime costs money. The exact amount depends on your business, your traffic, and when the outage happens, but the costs are almost always higher than people estimate.

Revenue loss

The most direct cost. If your site generates revenue through sales, subscriptions, or advertising, every minute of downtime is lost revenue that you will not recover.

For e-commerce sites, the calculation is straightforward: take your average revenue per minute and multiply by the minutes of downtime. Gartner estimated the average cost of IT downtime at $5,600 per minute across industries. [1] For large e-commerce platforms during peak periods, the number is far higher. Amazon's estimated cost of downtime in 2013 was $66,240 per minute. [2] Adjusted for growth, that figure is substantially higher today.

The revenue impact extends beyond the outage itself. Users who tried to make a purchase during the outage may not come back. Ad campaigns driving traffic to a broken site waste their entire budget. Affiliate partners lose confidence.

For a detailed breakdown, see cost of website downtime.

Reputation damage

Trust is fragile. A single outage during a critical moment, like a product launch, a breaking news event, or a sales promotion, can define how users perceive your reliability for months or years afterward.

Social media amplifies the damage. When a popular service goes down, it trends on X (formerly Twitter) within minutes. Screenshots of error pages get shared widely. The narrative shifts from "they had a brief outage" to "they are unreliable."

B2B customers are particularly sensitive to reliability concerns because their own services may depend on yours. An outage that affects your API can cascade into outages for your customers, damaging relationships at multiple levels.

SEO impact

Search engines crawl your site continuously. When Googlebot arrives during an outage and receives a 5xx error, it notes the failure. A single brief outage is unlikely to affect rankings. But repeated outages, or a prolonged one, signal to search engines that your site is unreliable.

Extended downtime can lead to pages being temporarily deindexed. If your site is down for hours, Google may remove affected pages from search results until it can verify the site is back. Recovering those rankings takes time and is not guaranteed. [3]

Support costs

When your site goes down, your support team bears the immediate burden. Users file tickets, send emails, call phone lines, and post on social media. Support volume spikes dramatically during outages, and every interaction costs money in staff time.

Even after the site comes back, the support backlog from the outage takes hours or days to clear. And each of those support interactions is a negative customer experience that has its own downstream effects on retention and satisfaction.

Contractual penalties

If you sell a service with an SLA, downtime that exceeds your promised availability triggers contractual consequences. Service credits, reduced fees, or contract termination clauses can all kick in.

An SLA promising 99.9% uptime allows 8 hours and 45 minutes of downtime per year. A single bad outage can consume that entire budget in one day. See what is high availability for how high-availability architectures help you meet aggressive SLA targets.

The cost of downtime is not just the revenue you lose during the outage. It includes the customers who leave, the search rankings that slip, the SLA credits you owe, and the staff time spent on incident response and cleanup. The true cost is almost always three to five times the direct revenue loss.

Incident response metrics

You cannot improve what you do not measure. The following metrics give you a framework for quantifying your incident response performance and tracking improvement over time.

MTTD (Mean Time to Detect)

MTTD measures the average time between the start of an incident and when your team becomes aware of it. If your site goes down at 2:14 PM and your monitoring alert fires at 2:15 PM, your MTTD for that incident is 1 minute.

MTTD is primarily a function of your monitoring configuration. Check frequency, monitoring locations, and alert routing all affect how quickly you learn about a problem. A 1-minute check interval gives you MTTD of roughly 1 to 2 minutes. A 5-minute check interval means up to 5 minutes of undetected downtime per incident.

MTTA (Mean Time to Acknowledge)

MTTA measures the time between an alert being fired and a team member acknowledging it. This metric captures how quickly your on-call process puts a human on the problem.

High MTTA indicates issues with alert routing, on-call schedules, or notification channels. If alerts go to an email inbox that nobody checks at night, your MTTA during off-hours will be terrible.

MTTR (Mean Time to Repair)

MTTR measures the total time from incident start to resolution. It is the single most important incident response metric because it directly corresponds to the duration of user impact.

MTTR = MTTD + MTTA + time to diagnose + time to fix.

Improving any component reduces total MTTR. Faster detection, faster acknowledgment, better runbooks for diagnosis, and automated remediation all contribute.

MTBF (Mean Time Between Failures)

MTBF measures the average time between incidents. A higher MTBF means your systems are more reliable. A declining MTBF means something is getting worse, whether it is aging hardware, growing technical debt, or increasing load.

MTBF is a lagging indicator. It tells you about past reliability. But tracking it over time reveals trends that help you anticipate future problems and prioritize infrastructure investment.

MTTF (Mean Time to Failure)

MTTF is similar to MTBF but applies specifically to non-repairable systems or components. In practice, it is used for hardware components and is less relevant for software systems where the "failure" is a crash that can be recovered from.

Using metrics effectively

Individual incident metrics are useful for post-mortems. Aggregate metrics over time are useful for trend analysis and goal setting. Track MTTR, MTBF, and MTTD monthly and quarterly. Set targets based on your current baseline, then work systematically to improve.

For more on how these metrics fit together, see incident response metrics.

Preventing downtime

Prevention is cheaper than remediation. Every dollar spent on redundancy, monitoring, and testing saves multiple dollars in lost revenue, support costs, and reputation repair.

Redundancy

Single points of failure are the enemy. Anywhere your architecture depends on exactly one instance of something, you have a downtime risk. The solution is redundancy at every layer:

Multiple servers. Run at least two instances of your application behind a load balancer. If one goes down, the other handles traffic.
Multiple availability zones. In cloud environments, spread instances across availability zones. An AZ outage should not take down your entire service.
Multiple regions. For the highest availability, deploy across multiple geographic regions with failover routing. This protects against regional outages.
Multiple DNS providers. Use secondary DNS to protect against DNS provider outages. If your primary DNS goes down, the secondary continues to resolve your domain.
Database replication. Run read replicas and configure automatic failover for your primary database. A database failure without replication means total downtime.

See high availability hosting for architecture patterns that minimize single points of failure.

Monitoring

You cannot fix what you do not know about. Monitoring is the foundation of incident response, and the quality of your monitoring directly determines your MTTD.

Effective monitoring covers multiple layers:

Uptime monitoring. Check your site from multiple locations every minute. Validate both status codes and content. Follow uptime alerts best practices to avoid alert fatigue while maintaining coverage.
SSL monitoring. Track certificate expiry dates with escalating alerts. Monitor chain validity and configuration. An expired certificate is a preventable outage.
DNS monitoring. Verify DNS resolution continuously. Detect record changes, propagation failures, and provider outages.
Application performance monitoring. Track response times, error rates, and throughput. Degradation is often a precursor to a full outage.
Infrastructure monitoring. Watch CPU, memory, disk, and network utilization. Resource exhaustion is a top cause of crashes.

Testing

Testing your resilience before an incident is better than discovering weaknesses during one.

Load testing. Simulate peak traffic to identify bottlenecks and capacity limits. Know your breaking point before real traffic finds it.
Chaos engineering. Deliberately inject failures (kill processes, simulate network partitions, fill disks) to verify that your redundancy and failover mechanisms actually work.
Deployment testing. Use staging environments that mirror production. Test every deployment in staging before it reaches production. Automate rollback so a bad deployment can be reversed in seconds.
Disaster recovery drills. Practice your incident response process regularly. Simulate outages and walk through the response. The drill will reveal gaps in your runbooks, communication plans, and escalation procedures.

Incident response planning

When downtime occurs, the speed of your response depends on how well you have prepared. An incident response plan defines:

Who responds. On-call rotation, escalation paths, and contact information.
How they are notified. Alert channels, notification priority, and acknowledgment requirements.
What they do first. Triage procedures, diagnostic checklists, and runbooks for common failure modes.
How they communicate. Status page updates, internal communication channels, and customer notification templates.
How they close the loop. Post-incident review process, action item tracking, and metric recording.

For specific recommendations on structuring your incident workflow, see how to reduce website downtime.

Incident response workflow

When your monitoring alerts fire and your site is down, here is the practical sequence of steps to follow. This workflow applies regardless of the root cause.

Step 1: Detect and alert

Your monitoring system detects the failure and sends an alert. The alert should include:

What failed (which monitor, which URL, which check type)
When it failed (timestamp of first failed check)
Where it failed (which monitoring locations report failures)
The error details (status code, timeout, connection refused, SSL error)

If your monitoring is well-configured, this step happens automatically within 1 to 2 minutes. See uptime alerts best practices for setting up effective alerting.

Step 2: Acknowledge and assess

The on-call responder acknowledges the alert and performs an initial assessment:

Is this a total or partial outage?
Which services are affected?
Is this a known issue type with an existing runbook?
How many users are likely affected?

This assessment determines the urgency and the escalation level. A total outage of your production website at noon on a Tuesday demands all hands. A partial outage of an internal tool at midnight can wait for normal hours if it is not customer-facing.

Step 3: Diagnose

Work through the diagnostic checklist:

Can you reach the server at all? (ping, SSH)
Is the web server process running? (systemctl status, process list)
Are there recent error log entries? (web server logs, application logs)
Did anything change recently? (deployments, configuration changes, DNS updates)
Is it a resource issue? (CPU, memory, disk, connections)
Is it DNS? (resolve the domain from multiple locations)
Is it SSL? (check certificate validity and chain)
Is it a third-party dependency? (check status pages of dependencies)

If your website is down and you are not sure where to start, work through this list from top to bottom.

Step 4: Mitigate

The priority is restoring service, not finding root cause. If rolling back the last deployment fixes the problem, roll back first and investigate later. Common mitigation actions:

Restart the web server or application process
Roll back a recent deployment
Failover to a backup server or region
Clear a full disk by rotating or deleting logs
Renew an expired SSL certificate
Correct a DNS misconfiguration

Step 5: Communicate

Update your status page. Notify affected customers if appropriate. Post in internal channels so the rest of the team knows the situation. Good communication during an outage reduces support volume and preserves trust.

Step 6: Resolve and review

Once service is restored, verify from multiple locations and check types. Confirm that the fix is stable, not just temporarily masking the problem.

Then conduct a post-incident review. Document:

Timeline of the incident (when it started, when it was detected, when it was resolved)
Root cause
Impact (users affected, duration, revenue lost)
What worked well in the response
What could be improved
Action items to prevent recurrence

Track those action items to completion. A post-mortem without follow-through is just documentation.

Downtime for IT teams

For organizations with dedicated IT infrastructure, downtime takes on additional dimensions. Internal applications, enterprise systems, and employee-facing tools all have their own availability requirements and cost structures.

IT downtime affects employee productivity directly. If your internal CRM goes down, your sales team cannot work. If your email server fails, communication stops. If your VPN goes down, remote employees are locked out. The IT downtime guide covers the specific challenges and metrics relevant to internal infrastructure.

The principles are the same as external-facing systems: redundancy, monitoring, and incident response. But the stakeholders and communication channels differ. Instead of updating a public status page, you update an internal IT status channel. Instead of measuring lost revenue, you measure lost employee hours.

Track downtime costs separately for customer-facing and internal systems. The cost calculations differ (lost revenue vs. lost productivity), and the investment priorities often differ as well. An internal tool that goes down for an hour might cost less than a customer-facing API that goes down for five minutes.

Building a downtime budget

Perfect uptime is not achievable, and pursuing it beyond a certain point produces diminishing returns. Instead, define an acceptable level of downtime and invest accordingly.

An SLA of 99.9% allows approximately 8 hours and 45 minutes of downtime per year. For most businesses, this is a reasonable target. Achieving 99.99% (52 minutes per year) requires significantly more investment in redundancy, failover automation, and monitoring. Achieving 99.999% (5 minutes per year) requires architecture and operational practices that most organizations do not need.

Your downtime budget is the total amount of downtime your SLA permits. Track your actual downtime against this budget throughout the year. If you burn through half your budget in Q1, you know you need to invest in reliability improvements immediately.

The key metrics to track against your budget:

Total downtime minutes per month and per quarter
Number of incidents per month
MTTR trend (is your recovery getting faster or slower?)
MTBF trend (are incidents becoming more or less frequent?)

These numbers should drive quarterly conversations about infrastructure investment. If MTBF is declining, something in your environment is degrading and needs attention. If MTTR is increasing, your incident response process needs improvement.

Measuring availability

Availability is typically expressed as a percentage over a given period. The standard formula is:

Availability = (Total time - Downtime) / Total time x 100

The industry uses "nines" as shorthand for availability levels:

99% (two nines) = 3 days, 15 hours, and 36 minutes of downtime per year
99.9% (three nines) = 8 hours and 45 minutes of downtime per year
99.95% = 4 hours and 22 minutes of downtime per year
99.99% (four nines) = 52 minutes and 34 seconds of downtime per year
99.999% (five nines) = 5 minutes and 15 seconds of downtime per year

Each additional nine roughly requires a tenfold increase in infrastructure investment, operational maturity, and automation. Most businesses should target three nines (99.9%) as a starting point and invest further only if the business case justifies the cost.

When calculating availability, be precise about what you are measuring. Availability of the web server is different from availability of the full user experience. A server can return 200 OK while the application behind it is broken. Define availability in terms that match what your users actually experience: can they complete the core actions your site exists to support?

Common downtime scenarios and quick fixes

Here are the most frequent downtime scenarios and the fastest path to resolution for each.

Server returns 502 Bad Gateway. Your reverse proxy (Nginx, HAProxy) cannot reach the backend application. Check if the application process is running. Restart it if needed. Check application logs for crash reasons.

Server returns 503 Service Unavailable. The server is overloaded or in maintenance mode. Check resource utilization. If traffic is the cause, scale up or enable rate limiting. If a maintenance flag is set, disable it.

DNS not resolving. Verify nameserver configuration at your registrar. Check if your DNS provider is experiencing an outage. Verify the zone file for correct records.

SSL certificate expired. Renew immediately. For Let's Encrypt, run certbot renew --force-renewal. For other CAs, log into your account and reissue. Install and restart.

Site loads but checkout is broken. This is a partial outage. Check the specific service or database that handles transactions. Review application logs for errors in the checkout flow.

Site unreachable from specific regions. Check CDN health. Verify DNS resolution from affected regions. Check for ISP-level routing issues. Review CDN configuration for regional settings.

References

Gartner, "The Cost of Downtime." https://www.gartner.com/en/documents/3956882
Statista, "Amazon.com revenue loss during downtime events." https://www.statista.com/statistics/266741/net-revenue-of-amazoncom/
Google Search Central, "Server connectivity issues and Googlebot crawling." https://developers.google.com/search/docs/crawling-indexing/http-network-errors
Uptime Institute, "Annual Outage Analysis." https://uptimeinstitute.com/resources/research-and-reports
NIST, "Contingency Planning Guide for Federal Information Systems," SP 800-34 Rev. 1. https://csrc.nist.gov/publications/detail/sp/800-34/rev-1/final
PagerDuty, "Incident Response Documentation." https://response.pagerduty.com/
AWS, "Summary of the Amazon S3 Service Disruption," Post-Incident Report. https://aws.amazon.com/message/41926/

Detect downtime before your users do

Monitor your website from multiple locations with checks every minute. Instant alerts through email, Slack, SMS, and more.

Try Uptime Monitor