dev_to 2026年4月25日

Anthropic 2026 年 4 月 23 日の事後報告：Claude Code の一ヶ月間にわたる品質低下の背負われた要因 3 つ

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

Translated: 2026/4/25 4:56:19

anthropicclaude-codellm-productionincident-postmortemcache-optimization

Japanese Translation

航空事故報告を読むような文体——機関故障、パイロットの過ち、悪天候。それら任何一个だけでは墜落はなかったが、3 つが重なり墜落を招いた。過去一ヶ月、Claude Code を使用している開発者たちは品質の低下を報告している：回答が短くなる、セッション中コンテキストが失われる、利用上限消費が加速する。2026 年 4 月 23 日、Anthropic は詳細なエンジニアリング上の事後報告（postmortem）を公開し、多くの人が懸念していたことを確認した。3 つの独立した、並行して導入された変更が、広範囲で診断が困難な症状を生み出し、重なり合っていた。これは私が见过的最も透明性のある AI ラボの事後報告の其中一个である。AI システムの事象ガバナンスに関するエンジニアリング上の教訓は、本格的運用下で LLM を利用しているソロ開発者や小規模チームにとっての重要な教訓である。 Anthropic の報告によると、影響を受けた製品は 3 つに限定された： Claude Code Claude Agent SDK Claude Cowork Anthropic API 自体は影響を受けなかった。つまり、API を直接アプリに接続している場合はおそらく何もしられた感じなかったでしょう。Claude Code CLI を使用している場合、4 月はツールと戦っていたことになる。 GitHub の記録がこれに合致している。anthropics/claude-code の issues は 4 月に急増した。Issue #49244 は「Opus 4.6 の 4 月 15 日以降の明白な品質低下」を報告。Issue #49585 は、smoosh パイプライン上でキャッシュヒット率が一時的 99.8% からほぼゼロに落ちたことを追跡した。AMD AI セニア・ディレクターの Stella Laurenzo による Issue #42796 の分析では、思考深さが 67% 低下し、6,852 回のセッションで月次請求金が$345 から$42,121 へと跳ね上がったことが示された。ユーザーたちは架空の思いではなかった。インストルメンテーション（監視データ）はそれを示していた。 3 つの変更のうち最も単純なもの。Anthropic は Claude Code におけるデフォルト思考努力（thinking effort）を、Sonnet 4.6 および Opus 4.6 において高い（high）から中（medium）に落とした。動機は UX への配慮だった：高いモードでは、Claude が回答する前に 1 分間考える必要がある。一部のユーザーは、サイレントな UI を「凍り付いた」と認識し、セッションを離れてしまった。低い努力レベル＝より速い感知応答＝（理論上は）より良い UX だったはずである。 Before (Mar 4): default_thinking_effort = "high" After (Mar 4): default_thinking_effort = "medium" しかし、ユーザーの反応は意図した通りとは逆だった。コミュニティは強力に反対した：「知能が低下した」。裏の偏好は明確に表れていた：ユーザーたちは、正確な答えのために 60 秒待つことを、速いが誤った答えを得ることに優先する傾向があった。 April 7: Anthropic はこれを逆転させた。4.6 系列におけるデフォルトは高い（high）に戻り、Opus 4.7 は xhigh となる。簡単なタスク向けの低い努力はオプションとして利用できるようにされた。これは最も技術的に興味深い失敗である。陳腐化された思考セクションを、1 時間以上のアイドルセッション後にクリアするためのキャッシュ最適化が、header clear_thinking_20251015 を keep:1（閾値に達した時に一度だけクリア）を用いて導入された。実装は意図から逸脱していた。閾値を跨过った際に一度だけクリアする代わりに、コードは閾値を跨过った後で毎ターンにクリアした。Claude は自らの推論追跡について恒常的に忘却（amnesic）状態になった。症状はユーザーが報告したものと完全に一致した：セッション当初に決断したことを忘れる既に試したアプローチを繰り返すコンテキストなしで間違ったツールを選択するキャッシュミスのアバランチー（雪だるま式）増加→より速い利用上限消費経済的打撃は二重だった：各リクエストが完全なコンテキストを再構築する必要があり（トークンが増える）、費用を吸収はずだったキャッシュレイヤーが故障していた。Pro/Max 5x/Max 20x のユーザーは、通常のワークロードでも上限に達した。ここが不快感を抱かせる場所だ。このバグは以下の通り通過した：複数の人間によるレビュー自動化されたコードレビューユニットテスト E2E テスト内部での製品実装利用（dogfooding）それでも、プロダクション環境にリリースされ、約 2 週間存続した。Anthropic の事後事象調査で、CLI セッションにおけるバグをマスキングした 2 つの関連ない実験が見つかった：サーバーサイドメッセージキューイング実験思考表示変更実験 Anthropic 内部チームによる Claude Code の実装利用（dogfooding）は、これらの実験グループに偶然含まれており、そこで症状が異なった形で現れた。外グループのユーザーたちはバグに素直に撞かれた。事後、Anthropic のエンジニアは Opus 4.7 で同じコードレビューを行った。結果は： Opus 4.7 はバグを発見した。 Opus 4.6 は発見しなかった。次世代モデルが潜在的な不整合を識別した（文中断）

Original Content

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop Reading like an aviation accident report — engine fault, pilot error, weather. Each alone wouldn't have crashed the plane. Three together did. For the past month, developers using Claude Code have reported degraded quality: shorter responses, lost context mid-session, and faster usage limit consumption. On April 23, 2026, Anthropic published a detailed engineering postmortem confirming what many suspected. Three separate, concurrently-deployed changes compounded into broad, hard-to-diagnose symptoms. This is one of the most transparent AI lab postmortems I've seen. The engineering lessons on incident governance for AI systems are worth unpacking — especially for solo developers and small teams working with LLMs in production. Per Anthropic's writeup, only three products were affected: Claude Code Claude Agent SDK Claude Cowork The Anthropic API itself was not affected. So if you wired your app directly to the API, you likely felt nothing. If you used Claude Code CLI, you spent April fighting your tools. The GitHub trail backs this up. anthropics/claude-code issues spiked in April. Issue #49244 reported "Opus 4.6 obvious quality drop from April 15." Issue #49585 traced cache hit rate falling from 99.8% to near-zero on the smoosh pipeline. AMD AI senior director Stella Laurenzo's analysis on issue #42796 showed 6,852 sessions with 67% drop in thinking depth and a monthly bill spike from $345 to $42,121. Users weren't imagining it. The instrumentation said so. The simplest of the three. Anthropic dropped the default thinking effort for Claude Code from high to medium on Sonnet 4.6 and Opus 4.6. The motivation was UX-friendly: in high mode, Claude could think for a full minute before responding. Some users perceived the silent UI as "frozen" and abandoned the session. Lower effort = faster perceived response = better UX, in theory. Before (Mar 4): default_thinking_effort = "high" After (Mar 4): default_thinking_effort = "medium" User reaction was the opposite of intended. The community pushed back hard: "Intelligence dropped." The underlying preference revealed was clear: Users would rather wait 60 seconds for an accurate answer than get a fast wrong one. April 7: Anthropic reverted. Default went back to high for 4.6 family. Opus 4.7 now defaults to xhigh. Lower effort is opt-in for trivial tasks. This is the most technically interesting failure. A caching optimization was deployed to clear stale thinking sections after 1+ hour idle sessions, using header clear_thinking_20251015 with keep:1 (clear once at threshold). The implementation diverged from intent. Instead of clearing once when the threshold was crossed, the code cleared on every turn after crossing. Claude was perpetually amnesic about its own reasoning trace. Symptoms aligned exactly with what users reported: Forgetting decisions made earlier in the session Repeating already-tried approaches Selecting wrong tools without context Cache miss avalanche → faster usage limit consumption The economic impact was double: every request rebuilt the full context (more tokens), and the cache layer that should've absorbed cost was broken. Pro/Max 5x/Max 20x users hit limits on routine workloads. This is where it gets uncomfortable. The bug passed: Multiple human reviews Automated code review Unit tests E2E tests Internal dogfooding Yet it shipped to prod and survived for ~2 weeks. Anthropic's post-incident debugging found two unrelated experiments that masked the bug in CLI sessions: Server-side message queueing experiment Thinking display change experiment Anthropic staff dogfooding Claude Code happened to be in those experiment cohorts, where the symptoms manifested differently. Outside-cohort users hit the bug clean. After the fact, Anthropic engineers ran the same code review with Opus 4.7. The result: Opus 4.7 found the bug. Opus 4.6 didn't. A next-gen model identified a blind spot in the prior gen's reasoning. There's a generalizable habit hidden here for any team using LLMs to review code: periodically re-audit current-gen outputs with the next-gen model. It costs little and surfaces issues your existing review chain misses. The bug was fixed in v2.1.101 on April 10. The last change is the most counterintuitive. Anthropic added two lines to the Claude Code system prompt: Length limits: - Keep text between tool calls to ≤25 words. - Keep final responses to ≤100 words unless the task requires more detail. Why? Opus 4.7 was launched verbose. Smart, but high output token cost. The two-line nudge was a lightweight way to control token spend. The change went through multi-week internal testing. The standard eval suite showed no regression. So it shipped. Ablation done after the incident — removing each line of the system prompt and measuring impact — found a 3% drop on one specific eval that the standard suite never tested. The verbosity constraints were quietly degrading model performance on that dimension. April 20: reverted. Anthropic's biggest mistake was deploying three changes in a tight window. Each had its own justification. The aggregate effect was diffuse, hard-to-pinpoint user experience degradation. For solo devs: same week price change + UX redesign + new feature = no causal attribution when something breaks. One variable at a time is the foundation of debuggability. Sequence your rollouts even when it slows you down. If Anthropic's full validation chain (multi-human review, auto code review, unit tests, E2E tests, dogfooding) can ship a bug, your eval setup has gaps too. The defense isn't "more eval" — it's periodic eval validation: Run ablation tests on changes (one variable removed at a time) Back-test current behavior with a different model generation Compare eval predictions to real user behavior data Treat eval-passing as a necessary signal, not sufficient 3. System Prompts Are Production Code Two lines in a system prompt caused 3% eval regression. System prompts deserve: Version control (commit + diff each change) Ablation testing (remove each line, measure impact) Soak periods (gradual rollout, monitor for N days) Regression suites that grow with each incident For solo devs: at minimum, version your prompts in git. A/B test changes before full deploy. Don't change a working prompt without instrumentation in place. Anthropic reverted the default effort drop within 5 weeks because users explicitly said the new default was wrong. The opposite of sycophancy: when users push back on defaults, change the defaults — not the user expectations. For solo devs: if you find yourself explaining away user complaints ("they don't understand the new feature"), you're probably wrong about the default. The user is your eval set for things your tests don't measure. The Opus 4.7 finding the 4.6 blind spot is genuinely useful. Anthropic released a stronger model that retrospectively audited the prior gen's reasoning. You can use this loop too: when a new model drops, re-run your existing code reviews / docs / prompts through it. The cost-benefit is excellent for finding latent issues. The postmortem ends with concrete commitments: Internal staff use the exact same public Claude Code build (not internal test builds) Stricter gates on system prompt changes: per-model ablation, audit tools, ongoing review CLAUDE.md gains per-model change guidance (target specific models explicitly) Soak periods + broad eval sets + gradual rollouts for any intelligence trade-off changes New @ClaudeDevs X account explaining product decision context Centralized GitHub threads mirroring the same updates All subscriber usage limits reset on April 23 (real-cost compensation) This is the bar for AI lab incident response: transparent root cause, named individuals, technical depth, concrete commitments, and material remediation. The deeper lesson isn't "Anthropic made mistakes." It's that AI system incidents are harder to debug than traditional software incidents because: Eval coverage is inherently incomplete Single prompt lines can have measurable model-wide impact Models are non-deterministic, making reproduction hard Caching, routing, and prompt layers interact in complex ways So incident governance has to evolve. Ablation, back-testing, gradual rollouts, user feedback channels, and eval system validation itself all need to be standard procedure. This is true at Anthropic's scale and at solo developer scale — arguably more important at the smaller scale because resources are tighter. Next time Claude or another AI tool feels off, don't immediately blame your prompt. Check the official channel, GitHub issues, community feedback first. Systemic changes have systemic effects. You're probably not the only one. Source: The April 23 postmortem - Anthropic Engineering Related: GitHub anthropics/claude-code #49244 — User report of Opus 4.6 quality drop from Apr 15 GitHub anthropics/claude-code #49585 — Smoosh pipeline cache hit rate analysis GitHub anthropics/claude-code #42796 — Stella Laurenzo (AMD) cost spike investigation