Back to list
「失敗を早く」の戦略が分散システムを殺しているのか(そしてそれをどう修復するか)
Why Your "Fail-Fast" Strategy is Killing Your Distributed System (and How to Fix It)
Translated: 2026/3/21 7:00:49
Japanese Translation
午前 2 時。PagerDuty のアラート。Redis のマスターが停止した。あなたのアプリケーションは「失敗を早く」という訓練を受けており、義務的に失敗しました——すべてのリクエストが一度に失敗しました。Sentinel が 12 秒後に新しいマスターをプロモートした頃には、すでに 40,000 エラーと 3 つのアップスケーリング呼び出しを生成してしまっていました。システムは単独で回復しました。しかし、あなたのアプリケーションはそれを許しませんでした。
これは、「よいエンジニアリング」が 12 秒のインフラ事象を 12 分の障害に変え、それを防ぐ境界線をどう設計するかという物語です。
tl;dr — インフラのフェイルオーバー(Redis, Kafka, etcd など)の際、盲目的な「失敗を早く」は不安定さを増幅します。境界を設定したリトライ——集中管理された、時間制限された、ビジネスロジックに対して不可視である——は、10〜15 秒の回復ウィンドウを吸収し、ユーザーにインフラノイズを漏らさずに済ませます。堅牢性(Resilience)はライブラリではありません。それはレイヤー同士の契約です。
セッションストレージ(Redis, Memcached, または任意のステートフルな依存関係)が一時的に利用できない場合、あなたは基本的なアーキテクチャ上の選択を直面します:
失敗を早くすべきか、それともリトライすべきか?
私たちは誰もが「失敗を早く」を公教(gospel)として学びました。そしてそれは——それは、そうではないUntil。一時的なインフラ事象(リーダー選挙など)において、盲目的な「失敗を早く」は不安定さを引き起こし、それを抑止しません。
あなたの選択が、事象が 12 秒で解決するか、ブリッジ呼び出し 3 つを伴う 12 分の障害に雪だるま式に成長するかを決めます。
「失敗を早く」が裏を返す理由を理解するには、Redis Sentinel のフェイルオーバーのメカニクスを見てみましょう:
| フェイズ | 期間 | 発生すること |
| :--- | :--- | :--- |
| 検出 | 〜10-12s | Sentinel クォーラムがマスターがダウンしていることを検出 |
| 選挙 | 〜1-2s | Sentinels が新しいマスターを合意する |
| プロモーション | 〜1s | レプリカがプロモートされ、クライアントに通知される |
| 再接続 | 〜1-3s | クライアントが接続を再確立する |
*注:これらのフェイズは重なります。一般的に、フェイルオーバーは個々のフェイズの合計ではなく、12〜15 秒で完了します。再接続時間も、あなたのクライアントライブラリに大きく依存します。Sentinel 感知のクライアントでトポロジ再読み込み機能(例:Lettuce, Sentinel 対応の go-redis)を持つクライアントは 1 秒未満で再接続しますが、素朴な接続プールの場合は 30 秒以上かかる可能性があります。
このウィンドウ中、あなたのアプリケーションは TCP ダイヤルタイムアウトと接続リセットを見ます。何かが壊れていません。データも失われていません。システムは設計されたように新しいリーダーを選挙中にあります。あなたのアプリケーションは、ただ 12 秒中パニックをしない必要があります。
このウィンドウ中に最初の接続タイムアウトで直ちに失敗するあなたのアプリケーションが、4 つのことが順次に急激に起こります:
1. 3 秒のインフラのフラッターがユーザーに見える障害になります。フェイルオーバーウィンドウ中のすべてのリクエストはエラーを返しますが、システムは単独で回復しようとしています。
2. あなたのビジネス層は、Redis を知っているわけでもその重要性を理解していないクライアントへ、生のインフラの詳細——「Redis 接続拒否」——を露出します。
3. エラーを受け取ったクライアントは個別にリトライを開始します。あなたが 1,000 人の同時ユーザーがいる場合、それぞれが 3 回リトライするならば、あなたはただ 1,000 QPS を 3,000 QPS に変えてしまいました——すでに安定化しようとしているインフラ層を攻撃しています。
これが破滅的な結果です。境界のないリトライは、カスケードする負荷増幅を創造します。CPU スピークは回復を防ぎます。システムは、回復を試みる行為自体がシステムをダウンに保ち続ける不稳定性フィードバックループに進入します。私はリトライ嵐で全領域をダウンにすることがあります。
「あなたのタイムアウト設定は技術的に正しい。あなたのシステムは機能的にダウンしていた。それはタイムアウトの問題ではありません——それは設計の問題です。」
これは実際に生産環境で重要な区別:失敗のタイプが、あなたの回復戦略を決めるべきです。
| レベル | 例 |
| :--- | :--- |
| インフラレベル | ネットワークジッター、リーダー選挙、接続リセット、READONLY レプリカ応答 |
| ビジネスレベル | 検証エラー、権限拒否、ドメインルール違反 |
| 性質 | 不永久——単独で解決する |
| 策略 | 吸収——境界内でのリトライ<br>失敗を早く——直ちにエラーを返す |
リーダー選挙タイムアウトをスキーマ検証エラーと同じように扱うことは、アーキテクチャ上のミスです。一方は数秒で解決します;他方は
Original Content
It's 2 AM. PagerDuty fires. Redis master is down. Your application, trained to fail fast, dutifully fails — every single request, all at once. By the time Sentinel promotes a new master 12 seconds later, you've already generated 40,000 errors and three escalation calls. The system recovered on its own. Your application didn't let it.
This is the story of how "good engineering" can make a 12-second infrastructure event into a 12-minute outage — and how to design boundaries that prevent it.
tl;dr — During infrastructure failovers (Redis, Kafka, etcd), blind fail-fast amplifies instability. Bounded retry — centralized, time-boxed, invisible to business logic — absorbs the 10–15 second recovery window without leaking infrastructure noise to users. Resilience is not a library. It is a contract between layers.
When your session storage — Redis, Memcached, or any stateful dependency — goes temporarily unavailable, you face a fundamental architectural choice:
Should you fail fast? Or should you retry?
We all learned fail-fast as gospel. And it is — until it isn't. During transient infrastructure events like leader elections, blind fail-fast propagates instability instead of containing it. The response you choose determines whether the incident resolves itself in 12 seconds or snowballs into a 12-minute outage with three bridge calls.
To understand why fail-fast can backfire, look at the mechanics of a Redis Sentinel failover:
Phase
Duration
What Happens
Detection
~10–12s
Sentinel quorum detects master is down
Election
~1–2s
Sentinels agree on a new master
Promotion
~1s
Replica promoted, clients notified
Reconnection
~1–3s
Clients re-establish connections
Note: these phases overlap. Total failover typically completes in 12–15 seconds, not the sum of individual phases. Reconnection time also depends heavily on your client library — a Sentinel-aware client with topology refresh (e.g., Lettuce, go-redis with Sentinel support) reconnects in under a second, while a naive connection pool can take 30s+.
During this window, your application sees TCP dial timeouts and connection resets. Nothing is broken. No data is lost. The system is doing exactly what it was designed to do — electing a new leader. Your application just needs to not panic for 12 seconds.
If your application fails immediately on the first connection timeout during this window, four things happen in rapid succession:
A 3-second infrastructure blip becomes a user-visible outage. Every request during the failover window returns an error, even though the system would have recovered on its own.
Your business layer now exposes raw infrastructure details — "Redis connection refused" — to clients that have no idea what Redis is or why it matters.
Clients receiving errors start retrying independently. If you have 1,000 concurrent users and each retries 3 times, you just turned 1,000 QPS into 3,000 QPS — hitting an infrastructure layer that's already struggling to stabilize.
This is the catastrophic outcome. Unbounded retries create cascading load amplification. CPU spikes prevent recovery. The system enters an instability feedback loop where the act of trying to recover keeps the system down. I've seen retry storms take down entire regions.
"Your timeout config was technically correct. Your system was functionally down. That's not a timeout problem — that's a design problem."
Here's the distinction that actually matters in production: the failure TYPE must determine your recovery strategy.
Infrastructure-Level
Business-Level
Examples
Network jitter, leader election, connection reset, READONLY replica response
Validation error, permission denial, domain rule violation
Nature
Transient — will resolve on its own
Permanent — retrying won't help
Strategy
ABSORB — retry within bounds
FAIL FAST — return error immediately
Treating a leader election timeout the same as a schema validation error is an architectural mistake. One will resolve in seconds; the other will never succeed no matter how many times you retry.
This is the architectural pattern that makes everything work:
The retry boundary sits in the infrastructure client wrapper — the thin layer between your business code and the dependency client. Not in HTTP middleware, not in individual service handlers, not in a sidecar. In the client wrapper itself.
Why does this matter? Because if retry logic exists at multiple layers, you get retry amplification. I've seen teams with retry in the HTTP handler, the service layer, AND the Redis client — producing 3 × 3 × 3 = 27 attempts per original request. That's not resilience. That's a DDoS against your own infrastructure.
Key principles:
Retry belongs at the infrastructure boundary — one place, one policy.
Business logic must remain fail-fast — semantic errors should never be retried.
By the time an error reaches the client, it has been vetted and classified. We are designing for predictability.
If we're going to retry, we must do it with discipline. Four pillars:
Retry logic lives in one place — the infrastructure client wrapper. Not in individual handlers, not in middleware, not in the business layer. One retry boundary per dependency, one policy, one set of metrics.
We define a retry budget — for example, 15 seconds. Why 15? Because it encapsulates the 10–12 second Sentinel detection window plus a margin for stabilization and reconnection. Time-based budgets are superior to pure attempt counts because they normalize across different failure modes — a retry that takes 5s per attempt behaves very differently from one that takes 100ms.
Maximum 2–3 retry attempts within the budget window, with exponential backoff and jitter. Without jitter, synchronized retries from multiple application instances create a thundering herd — everyone hits the new master at exactly the same moment.
If the retry succeeds within the budget, the business layer never knew there was a problem. If it fails, the business layer receives a clean, classified error — not a raw TCP stack trace that means nothing to anyone above the infrastructure layer.
Here's what this looks like in practice:
// Bounded retry wrapper — lives in the infrastructure client layer
func withBoundedRetry(ctx context.Context, budget time.Duration, maxAttempts int, op func() error) error {
deadline := time.Now().Add(budget)
var lastErr error
for attempt := 0; attempt < maxAttempts; attempt++ {
if time.Now().After(deadline) {
break
}
lastErr = op()
if lastErr == nil {
return nil // success — business layer never knew
}
if !isRetryable(lastErr) {
return normalizeError(lastErr) // permanent failure — fail fast
}
// Exponential backoff with jitter
backoff := time.Duration(1<<attempt) * 500 * time.Millisecond
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
select {
case <-time.After(backoff + jitter):
case <-ctx.Done():
return ctx.Err()
}
}
return normalizeError(lastErr) // budget exhausted — fail deterministically
}
┌─────────────────────────────────────────────┐
│ Retry Budget: 15 seconds │
│ │
│ Attempt 1 → timeout (5s) → backoff │
│ Attempt 2 → timeout (5s) → backoff │
│ Attempt 3 → success │
│ │
│ Total elapsed: ~11s │
│ Application impact: ZERO │
│ │
│ ─── OR ─── │
│ │
│ Budget exhausted → FAIL DETERMINISTICALLY │
│ Clean, classified error to business layer │
└─────────────────────────────────────────────┘
"Retry is not infinite. Retry is time-boxed. Once the budget is exhausted, we fail deterministically."
This is where most teams get it wrong. They retry everything — or nothing. The retry decision must be driven by error classification:
Raw Error
Normalized To
Retryable?
Why
TCP dial timeout
UNAVAILABLE
Yes
Connection not established, may recover
Connection reset
UNAVAILABLE
Yes
Transient network disruption
READONLY (replica)
UNAVAILABLE
Yes
Sentinel failover in progress — replica not yet promoted
Leader election in progress
UNAVAILABLE
Yes
Raft/consensus transition
OOM command not allowed
RESOURCE_EXHAUSTED
No
Backpressure — retrying makes it worse
WRONGTYPE
INVALID_ARGUMENT
No
Schema error — will never succeed
NOPERM / Permission denied
PERMISSION_DENIED
No
Auth failure — will never succeed
NOT_FOUND
NOT_FOUND
No
Semantic absence — retry won't create the resource
The READONLY case deserves special attention. During Sentinel failover, a replica that hasn't been promoted yet responds with READONLY to write commands. If your retry layer treats this as a permanent error, your circuit breaker trips, clients get errors, and a 12-second failover becomes a 5-minute outage while someone manually resets the breaker. Classify READONLY as UNAVAILABLE — it will resolve when the new master is promoted.
The rule is simple: you cannot leak internal implementation details up the stack. Your retry layer must inspect and reclassify errors — not just map them 1:1. Error semantics must align across every layer.
Bounded retry is the inner loop — it handles transient failures within a known recovery window. But what if the dependency is truly down, not just transitioning?
That's where circuit breakers serve as the outer loop:
Bounded retry absorbs transient events (leader election, network jitter) — seconds.
Circuit breaker protects against sustained outages (dependency truly dead) — minutes.
Without a circuit breaker, sustained failures chew through retry budgets on every request, wasting resources. Without bounded retry, every transient blip trips the circuit breaker unnecessarily. They are complementary, not redundant.
A production retry boundary must emit metrics. Without them, you're flying blind:
retry_attempt_total — how often retries fire (by dependency, by error type)
retry_budget_exhausted_total — how often the full budget is consumed without success
retry_success_on_attempt — which attempt number succeeds (histogram)
error_classification — distribution of retryable vs non-retryable errors
The key alert: if retry budget exhaustion rate exceeds ~5%, either your budget is too tight or your dependency is degraded beyond transient. This is the signal that distinguishes a leader election from a real outage — and it's the signal that should trigger your circuit breaker.
If this looks Redis-specific, zoom out. The bounded retry pattern applies to any stateful dependency with leader election:
Redis Sentinel — master failover with quorum detection, 10–15s window
NATS JetStream — stream leader election in the Raft group, typically 2–5s with default election timeout
etcd / Consul — Raft leader election, ~1–2s with default settings, but watch streams may buffer longer
Kafka — partition leader election via controller, typically 5–15s depending on replica.lag.time.max.ms and ISR size
CockroachDB / TiKV — range leader election, similar Raft mechanics
The mechanics are the same everywhere: a detection window, a brief period of unavailability, and then recovery. Design your retry budget to absorb that window. Calibrate the budget to the specific system — 15s for Redis Sentinel, 5s for NATS, 20s for Kafka.
Resilience is not a library you import. It is a contract between layers:
Layer
Responsibility
Infrastructure
Absorbs transient instability via bounded retry
Business
Remains fail-fast for semantic integrity
Client
Retries only when signaled retryable
When failure is bounded and classified, the system becomes predictable. And predictability is the foundation of operational confidence.
[ ] Retry Budget: Is my retry window matched to the dependency's failover time (e.g., 15s for Redis)?
[ ] Jitter: Do my retries have randomized sleep to avoid the "Thundering Herd"?
[ ] Error Classification: Does my code distinguish between READONLY (retryable) and PERMISSION_DENIED (not retryable)?
[ ] Centralization: Is my retry logic in the client wrapper, not leaked across handlers?
[ ] Observability: Do I have an alert if "Retry Budget Exhausted" exceeds 5%?
Fail fast — but not during transient infrastructure events. A leader election is not a business error. Don't treat it like one.
Retry must be bounded. Time-boxed, attempt-limited, with jitter. No open-ended retry loops.
Retry must be centralized. One retry boundary per dependency, at the infrastructure layer. Retry in multiple layers = retry amplification.
Failure semantics must be normalized. Retryable vs non-retryable must be explicit. Watch for READONLY — the most common Sentinel failover gotcha.
Resilience requires cross-layer alignment. Bounded retry (inner loop) + circuit breaker (outer loop) + observability = production-grade resilience.
Frequently Asked Questions
Should distributed systems always fail fast?
No. Fail fast for business-level errors (validation, permission, domain rules), but use bounded retry for transient infrastructure failures like leader election and temporary network instability.
In many production setups, 12-15 seconds is a practical starting point because it usually covers Sentinel detection, promotion, and client reconnection. Calibrate with your own failover timings and SLOs.
Only when explicitly signaled retryable. Blind retries at both layers often create retry amplification and can trigger a retry storm.
Bounded retry handles short transient windows (inner loop). Circuit breaker handles sustained dependency failure and stops repeated expensive attempts (outer loop).
While Mesh can retry, the application layer has better "semantic awareness." Only the app knows if a specific error is safe to retry based on idempotency.
For non-idempotent operations unless you have a robust request-ID tracking system. For business errors (400s), always fail fast.
Rust vs C Assembly: Complete Performance and Safety Analysis
Legacy Compatibility Lab: My Full Stack for Reviving Dead Software
Distributed systems are not about avoiding failure.
designing boundaries.
If retry is everywhere, the system becomes unpredictable.
The goal is not infinite retry.
The goal is bounded retry.
That boundary is what keeps systems stable.
Resilience is not a library. It is a contract between layers.
Based on a talk I gave on failure boundary design in distributed systems.
Originally published at harrisonsec.com. Listen to the deep dive audio for a detailed walkthrough.