November 19, 2025
4 mins read
In conversations about software reliability, availability targets are often expressed with reassuring simplicity: “We’re aiming for five nines.” Yet behind that short phrase lies one of the most complex, expensive, and nuanced challenges in software engineering. Achieving high availability is not merely a technical exercise, it is a multi-dimensional problem involving architecture, operations, process, and organisational maturity. And with each additional “9” of availability, the effort and cost required increase not linearly, but exponentially.
No system can be “always available.” Hardware fails, networks partition, dependencies become unreliable, and human error is inevitable. The appropriate question is not whether downtime will occur, but how much downtime is acceptable given the system’s purpose and the business context.
Availability is typically expressed as a percentage of uptime over a year. Even small improvements in this number represent significant differences in reliability expectations:
Downtime | |||||
Availability % | per year | per quarter | per month | per week | per day |
99% ("two nines") | 3.65 days | 21.9 hours | 7.31 hours | 1.68 hours | 14.40 minutes |
99.9% ("three nines") | 8.77 hours | 2.19 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes |
99.99% ("four nines") | 52.60 minutes | 13.15 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
99.999% ("five nines") | 5.26 minutes | 1.31 minutes | 26.30 seconds | 6.05 seconds | 864.00 milliseconds |
99.9999% ("six nines") | 31.56 seconds | 7.89 seconds | 2.63 seconds | 604.80 milliseconds | 86.40 milliseconds |
99.99999% ("seven nines") | 3.16 seconds | 0.79 seconds | 262.98 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
99.999999% ("eight nines") | 315.58 milliseconds | 78.89 milliseconds | 26.30 milliseconds | 6.05 milliseconds | 864.00 microseconds |
99.9999999% ("nine nines") | 31.56 milliseconds | 7.89 milliseconds | 2.63 milliseconds | 604.80 microseconds | 86.40 microseconds |
The difference between 99.9% and 99.99%, for example, is not merely 0.09% — it is the difference between tolerating nearly nine hours of downtime annually and less than one hour. That leap requires fundamentally different design decisions and operational capabilities.
Moving from “two nines” (99%) to “three nines” (99.9%) is relatively straightforward. Standard best practices such as redundant servers, load balancing, health checks, and rolling deployments are typically sufficient.
However, pursuing “four nines” (99.99%) introduces a new set of challenges. Achieving this level of reliability often requires:
Automated failover mechanisms and self-healing infrastructure
Multi-region deployments and data replication strategies
Robust CI/CD pipelines with comprehensive testing and rollback capabilities
Stringent change management processes to minimise operational risk.
Pushing towards “five nines” and beyond requires yet another order of sophistication, including:
Active-active architectures across geographic regions
Advanced observability, anomaly detection, and real-time alerting
Chaos engineering practices to proactively identify unknown failure modes
Highly disciplined on-call operations and well-rehearsed incident response procedures
At each stage, the problem is not simply about “doing the same things better.” Each additional nine introduces fundamentally new categories of risk that must be addressed.
A widely cited principle in site reliability engineering is that each additional nine costs roughly an order of magnitude more than the previous one. While the exact multiplier varies by context, the underlying principle holds: the cost curve for high availability is steep.
The reasons for this are structural:
Redundancy multiplies infrastructure spend.What once required two servers may now require four or eight, often across multiple regions.
Deployment and testing processes become more rigorous and time-consuming.The cost of an error grows with user expectations, necessitating more automation and validation.
Operational complexity increases.Achieving higher reliability demands specialised expertise, around-the-clock monitoring, and investment in tooling.
Dependencies propagate risk.Third-party services, APIs, and networks all become potential points of failure that must be mitigated — often through contractual SLAs, architectural isolation, or internal replacements.
As a result, organisations must carefully assess whether the incremental reliability gained by another nine justifies the significant increase in cost and complexity.
It is important to recognise that ultra-high availability is not always necessary, nor is it always desirable. The right availability target depends on the system’s purpose and the consequences of downtime.
For internal tools or non-critical consumer applications, 99.9% may be more than adequate.
For financial systems, healthcare platforms, or safety-critical infrastructure, anything less than 99.99% may be unacceptable.
The crucial point is that availability targets are business decisions as much as technical ones. They should be determined through a careful analysis of user expectations, regulatory requirements, operational risk, and the economic trade-offs involved.
High availability is not something that can be added late in a project or achieved solely through infrastructure choices. It is the outcome of deliberate architectural decisions, disciplined operational practices, and continuous investment. As each additional nine demands disproportionately more effort, the pursuit of availability becomes less about engineering prowess and more about strategic trade-offs.
Achieving five nines is possible, but it is a challenge that only a handful of organisations truly need, and even fewer can justify. For everyone else, success lies not in chasing an arbitrary number, but in designing systems that are reliably available enough for their purpose.