dev_to 2026年3月7日

I Builda了一個LLM請求数流層Proxy，自動切換模數在你連線超時之前

I built an LLM Request Cascade proxy that auto-switches models before you ever timeout

Translated: 2026/3/7 9:19:58

Japanese Translation

あなたはClaude Codeでタスクの中。おっしるを押す。それから。。。何も起きません。12秒後に、答えが来るのか、再度エラー。この遅さはバグではありません。それはOpusの最高峰時です。高労使の間、その遅くなるのが続けます。彼には、開発者は、アゲシックフローで何回もフラッシュします。Iはそれが感じられます。それで、私はClaude Codeに透明なプロキシーを建設しました。それはあなたのAI_AGENTとAPIの間にいて、あなたのモデルが遅いときは自動的にスイッチします。どうかあなたがtimeout前に経験しません。インストールを始まます pip install glide glide start export ANTHROPIC_BASE_URL=http://127.0.0.1:8743claude# Claude Codeはglideにルートを通過します。これは全体の設営です。標準的なリトライログでそれは同じモデルを再試験し、ますこと、もっとうわさがします。ローレンツウェイブが同じな単体で、レキシオンモデルは別々じゃないからです。LiteLLMは単固定路局付け、実際の瞬発的遅延に調整しません。彼ら全てが本当の失敗パターンを扱いません: 模型が今ここで遅くて、10分後に良くなるでしょう。TTFTは一連の出力が始まる前に可測できます。あなたは15秒待たなくてもモデルが遅いと何となく分かる。4秒の間にわかるんです。のでglideは各リクエストに基づいてTTFT予算を競います。超過した？つながり切れ、級のCascadeからすぐに始めましょう。claude-opus-4-6 TTFT予算：4s <-最高品質なもの。最初に試します claude-sonnet-4-6 TTFT予算：5s <-速くてのロケート claude-haiku-4-5 TTFT予算：3s <-最速Anthropicモデル qwen2.5:14b -ありません限界値がある <-ローカルOllama、常にOKです # 鍛錬インラインセレネスパーサーを構築しました、これは一部のバッファで、で分割されて、完了したイベントを使用して文字列を分割し、クライアントと監視する前にブロックをリソースします。積極的 routingが重労働を扱います。しかし、モデルが傾いていないときでも、予算の範囲内であるし、高すぎるときはあなたは個別の尾根のリクエストで追跡されます。これこそGoogleが「スケールでタイル」（2013）で解決した問題。同じ要求を複写モデルに送って、最初にレスポンスするものを使用しましょう。私はこれは多様なモデル階層に対して単一の原則を適用しました。ただし、毎回APIコストの増加は避けたいですね。それによってglideはそれぞれ個別のリクエスト毎にルート作決定します。決定条件と行動です： SOLO主要予算が80パーセント未満であれば主役のみ起動する SKIP両方リスクが高いので、ハイデッヒ全滅のものからシーケンシャルクリエーションを使用します def _hedge_decision(ハイデッヒモデル): p95_1 = registry.get([モデル[0].[モデル]).p95() です。 p95_1：Noneであればこれ、ハイデッヒはそれ自体を無限大に決定してください

Original Content

You're mid-task in Claude Code. You hit enter. Then... nothing. 12 seconds later, either the response arrives or you're refreshing. That lag isn't a bug. It's Opus under peak load. It happens constantly during high-traffic hours. And for a developer in an agentic workflow, it feels identical to a crash. I got tired of it, so I built glide: a transparent proxy that sits between your AI agent and the API, and automatically switches to a faster model when yours is slow, before you ever experience the timeout. pip install glide glide start export ANTHROPIC_BASE_URL=http://127.0.0.1:8743 claude # Claude Code now routes through glide That's the entire setup. Standard retry logic re-attempts the same slow endpoint, making things worse. Load balancers distribute across identical instances, but LLM models are not identical. LiteLLM does static routing and doesn't adapt to live latency. None of them address the actual failure mode: a model that's slow right now but will recover in 10 minutes. Time-to-First-Token (TTFT) is measurable during the stream, before the full response arrives. You don't have to wait 15 seconds to know a model is slow. You know at second 4. So glide races each request against a per-model TTFT budget. Exceed it? Connection cancelled, next model in the cascade starts immediately. claude-opus-4-6 TTFT budget: 4s <- best quality, tried first claude-sonnet-4-6 TTFT budget: 5s <- fast fallback claude-haiku-4-5 TTFT budget: 3s <- fastest Anthropic model qwen2.5:14b no limit <- local Ollama, always works If opus takes 8s to timeout and sonnet takes 5s, a naive cascade makes you wait 13s before reaching haiku. That's worse than just waiting for opus. glide maintains a rolling window of observed TTFT values per model (SQLite-backed, persists across restarts) and computes the p95 continuously. If a model's p95 already exceeds its budget, glide skips it without waiting. Normal day -> opus p95=2s -> serves in ~2s Peak load -> opus p95=11s -> skipped, sonnet serves in ~1.5s Recovery -> opus p95=3s -> resumes automatically No restarts. No config changes. No intervention. TTFT covers slow starts but misses a different failure: runaway extended thinking. Claude Opus with extended reasoning emits thinking tokens before any text. A request can get a fast TTFT (thinking starts immediately) but then spend 60 seconds in the reasoning phase. The user sees nothing the whole time. I added TTT (Time-to-Think): elapsed time from request start until the first text token after thinking completes. Budget exceeded mid-think? Abort and cascade. # Inline SSE parser, runs during the active stream if event_type == "content_block_start": if block_type == "thinking": ttt_start = time.monotonic() # start TTT clock elif block_type == "text": ttt = time.monotonic() - ttt_start if ttt > budget: raise TTTTimeoutError() # cascade to next model text_started = True # stream from here The tricky part: SSE events can span HTTP chunk boundaries, so you can't just parse per-chunk. I built a buffer that accumulates bytes, splits on \n\n, and parses complete events while yielding chunks to the client and monitoring inline. Proactive routing handles sustained load. But when a model is trending slow, not yet over budget but elevated, you're still exposed on individual tail requests. This is the same problem Google solved in "The Tail at Scale" (2013): send the same request to two replicas, use whichever responds first. I applied that idea across heterogeneous model tiers. But you don't want to double your API cost on every request. So glide computes a routing decision before each request using observed p95: Decision Condition Action SOLO primary p95 < 80% of budget Fire only primary, it's healthy HEDGE primary risky, backup healthy or cold Fire both, race on asyncio queue, stream winner, cancel loser SKIP both risky Skip hedge entirely, go to sequential cascade def _hedge_decision(hedge_models): p95_1 = registry.get(models[0].model).p95() if p95_1 is None: return "hedge" # cold start, hedge conservatively if p95_1 < budget_1 * 0.8: return "solo" # healthy, no cost wasted if p95_2 is not None and p95_2 >= budget_2 * 0.8: return "skip" # both slow, sequential is better return "hedge" # first risky, second healthy, race them The 80% threshold catches the trend before models actually start failing individual requests. When a hedge fires, the losing task gets task.cancel() which propagates through httpx's async with client.stream() context manager, closing the upstream HTTP connection immediately. No resource leaks. All cascade providers yield Anthropic SSE internally. glide converts at the edge for each provider: OpenAI uses anthropic_to_openai() for the request body and stream_openai_as_anthropic() for the response Gemini uses anthropic_to_gemini() and stream_gemini_as_anthropic() Ollama is already streaming JSON, wrapped to Anthropic SSE Mix providers freely: export CASCADE_JSON='[ {"provider": "anthropic", "model": "claude-opus-4-6", "ttft_budget": 4.0}, {"provider": "openai", "model": "gpt-4o", "ttft_budget": 5.0}, {"provider": "google", "model": "gemini-2.0-flash", "ttft_budget": 3.0}, {"provider": "ollama", "model": "qwen2.5:14b", "ttft_budget": null} ]' glide start Accepts both POST /v1/messages (Anthropic) and POST /v1/chat/completions (OpenAI). Returns the matching format automatically. curl http://127.0.0.1:8743/metrics glide_requests_total 42.0 glide_hedge_decision_total{decision="solo"} 30.0 glide_hedge_decision_total{decision="hedge"} 10.0 glide_hedge_decision_total{decision="skip"} 2.0 glide_hedge_winner_total{model="claude-sonnet-4-6"} 8.0 glide_ttft_p95_seconds{model="claude-opus-4-6"} 3.82 glide_ttft_p95_seconds{model="claude-sonnet-4-6"} 0.41 glide_ttft_samples_total{model="claude-opus-4-6"} 20.0 Standard Prometheus text format, no extra dependencies, formatted manually. Plug into Grafana or scrape directly. I'm calling this the LLM Request Cascade Pattern, a reliability primitive with three components: Budget-based streaming abort - TTFT and TTT as actionable in-stream health signals Proactive p95 routing - skip models whose recent observed p95 exceeds their budget Adaptive hedging - race models when borderline slow, not on every request It sits alongside two existing patterns: Circuit breaker (binary up/down) handled by llm-circuit Load balancing (identical replicas) not applicable to heterogeneous model tiers The cascade is specifically for the heterogeneous LLM ecosystem: different models with different quality/speed/cost tradeoffs, where you want to route to the best option that can actually respond in time. pip install glide glide start export ANTHROPIC_BASE_URL=http://127.0.0.1:8743 Works with Claude Code, Cursor, code_puppy, or anything using the Anthropic or OpenAI API. GitHub: https://github.com/phanisaimunipalli/glide Pattern docs: https://github.com/phanisaimunipalli/glide/blob/main/docs/the-cascade-pattern.md HN thread: https://news.ycombinator.com/item?id=47285435 22 tests, MIT license. Would love feedback especially on the mid-stream SSE abort implementation and the hedge trigger thresholds.