ST
Published on

Downtime Is a Resource

In engineering, we're constantly balancing competing priorities: time, cost and complexity. But there's one resource we rarely think of in these tradeoffs—downtime. Instead of treating it as a failure state, we should treat downtime as what it really is: a resource. And like all resources, if we don't spend it wisely, we waste something more precious—developer time, company money, or momentum.

What It Means to Spend Downtime

When I say downtime is a resource, I mean it's something we can choose to spend—not just something we're forced to endure. We spend time and money intentionally to achieve outcomes. Why not do the same with downtime?

Spending downtime can mean:

  • Speeding up a migration by allowing a window of interruption rather than painstakingly maintaining full availability.
  • Reducing architectural complexity by avoiding overengineered redundancy or failover mechanisms.
  • Accepting some operational risk to ship faster or simplify systems.

Example: As part of a data warehouse migration we deliberately chose to take a short downtime window. It made the change faster, safer, and ultimately cheaper. That was a smart trade.

The Cost of Zero-Downtime Thinking

Too often, teams fall into the trap of treating any downtime as unacceptable. We plan as if every second of interruption will ruin our business. But unless you're operating at Netflix-scale, that mindset is probably hurting you more than it's helping.

Here's what happens when "no downtime ever" becomes the norm:

  • We over-engineer for edge cases that never occur.
  • We slow down delivery, making migrations painful and high-stakes.
  • We create fear, where developers feel they can't take necessary risks.
  • We spend too much on cloud redundancy, automation, and complexity.

Engineering should be pragmatic. If you're exceeding your SLAs, you're not winning—you're wasting budget.

SLAs and the Downtime Budget You're Ignoring

SLAs exist for a reason. They define acceptable levels of availability. For example, 99.9% uptime gives you roughly 43 minutes of downtime per month. If you've had zero downtime for six months, you've built up a bank of 4+ hours. What are you doing with it?

SLAs help us bound risk, so we can be intentional about it:

  • Need to run a risky migration? Maybe 10 minutes of planned downtime is acceptable.
  • Want to remove an overly complex failover system? If the new setup still meets your SLA, it might be worth it.

If you're always staying way under your SLA, you're under-spending your downtime budget—and probably over-spending somewhere else.

Engineering Tradeoffs, Made Explicit

Tradeoffs are at the core of engineering. Downtime is part of that conversation.

  • Downtime vs. Speed: Migrations are faster when you can afford a window of unavailability. Planning for zero downtime often means building twice.
  • Downtime vs. Money: Redundancy is expensive. So is always-on failover infrastructure.
  • Downtime vs. Complexity: Simpler systems are easier to reason about and operate. Sometimes resilience adds more fragility than it removes.

When you treat downtime as an option—not a taboo—you unlock a whole category of better, faster, and cheaper decisions.

Changing the Culture Around Downtime

Engineers fear downtime. It's natural. We want to build reliable systems, but some of that fear is cultural:

  • Incidents can be seen as failures, not part of healthy engineering.
  • Developers can feel judged or blamed for risk—even when it's appropriate.
  • We optimise for blame avoidance, not business value.

That mindset holds us back. Engineering leaders need to model a different approach. We should:

  • Normalise discussion of downtime budgets and risk tolerance.
  • Celebrate well-justified tradeoffs—even if they include some risk.
  • Make sure developers feel safe spending downtime when it makes sense.

Blameless culture isn't just for postmortems. It's for tradeoff decisions too.

A Better Way Forward

Here's the question we should be asking:

"Can we spend a little downtime to save a lot of time or money?"

And if the answer is yes, we should be brave enough to do it.

That doesn't mean being reckless. It means knowing your constraints—your SLAs, your architecture, your user expectations—and optimising within them. It means trusting your engineers to make tradeoffs, not just avoid mistakes.

Because the real mistake isn't occasional downtime. It's wasting all your other resources trying to avoid it.

Final Thought

Downtime is a resource. If you never spend it, you're spending something else—and probably spending it poorly.