Back to list
見えないものは直らない:AI エージェントの可観測性危機
You Can't Fix What You Can't See: The AI Agent Observability Crisis
Translated: 2026/3/14 13:01:11
Japanese Translation
エージェントのデプロイメントでは通常、アップタイムのトラッキングが行われます。しかし、それだけでは不十分です。本記事では、本格的なエージェント可観測性の実態と、それを実現するためのツールについて解説します。
数ヶ月前、本番環境のエージェントパイプラインで何かが起こりました。システムは 3 週間続きました。エラーレートはほぼゼロ、レイテンシーは nominal(妥当)、アップタイムダッシュボードは緑色でした。その後、あるユーザーが、エージェントが最初の 2 日目からすべての応答で誤った API バージョンを推奨していることに気づきました。3 ヶ週間も、すべての回答が構文的に正しく、フォーマットも良好で、2 セカン্ড未満で返されるため、誰にも発見されませんでした。
これは、純粋な形をとった AI エージェント可観測性問題です:伝統的なモニタリング指標すべてが正常であっても、エージェントはcatastrophic な失敗を遂げています。
私たちは今週、AI エージェントデプロイメントの構造的な問題を検討してきました。記憶アーキテクチャが静かに劣化すること、マルチエージェントシステムがスケールとともにパフォーマンスを低下させること。両方の問題を通じているのは、これらの問題を診断するにはそれらを見なければという事実です。そして現在、多くのチームは盲点で飛んでいる状態にあります。
AI エージェントは伝統的なソフトウェアとは異なる失敗パターンを示します。構造的には有効だが、意味的に誤っている出力を生成し、従来の可観測性ではこれを検出することができません。クラッシュしたサービスは 500 エラーを返しますが、微妙に誤ったアドバイスを提示するエージェントは、スキーマ検証を通過する JSON パイロードを含む 200 ステータスコードを返します。
伝統的な可観測性は、ログ(何が起きたか)、メトリクス(頻度と速度)、トレース(実行の流れ)の 3 つのシグナルを追跡します。これらは必須ですが、エージェントにとっては不十分です。エージェントはツールエラーがゼロになり、すべてのステップを完了しても、最初のステップでユーザーの意図を理解しなかったために、自信を持ってそのエラーを 9 つの後のステップにプロパゲートしてしまったため、単に無意味になり得ます。
より深い問題は、非確実性です。ユニットテストが機能するのは、同じ入力が常に同じ出力を生むからです。エージェントは確率的で、同じプロンプトでも意味をなす異なる推理パスを生成します。自信を得るためにテストするのではなく、観察することでしか達成できません。これは根本的に異なる学問であり、まだ多くのエンジニアリングチームがその筋肉を作っていません。
また、マルチステップの失敗クレーズの問題もあります。従来の API キャリングは成功するか失敗するかだけですが、エージェントワークフローでは、結論に至る前に 12 つのツール呼び出し、4 つの取得ドキュメントの合成、3 つの中間出力が行われることがあります。最終的な回答が間違っているのは、ステップ 3 で誤ったドキュメントを取得したためかもしれませんが、誤った回答が見えるようになる頃には、トレースは 9 つの次のオペレーションの下に埋もれてしまいます。根本原因を特定するには、多くの可観測性ツールが用意されていないスパンレベルの可視性が求められます。
エージェントの失敗は、異なる検出戦略を必要とする 4 つの明確なカテゴリに分類されます。これらの理解は、それらを受け入れる可観測性スタックを構築する前提条件です。
1. 意味的ドリフト — エージェントの出力は技術的に正しいですが、時間の経過とともに意図された行動から徐々にずれていきます。エージェントに persistent メモリがあり、記憶状態が現実から乖離している場合によく起こります。1 月にトレーニングを受けたカスタマーサポートエージェントは、3 月に 1 月の製品価格を反映し始めます。
2. ツール信頼性の失敗 — エージェントが外部ツールを正しく呼び出すが、ツールが古くて誤っている、または不完全なデータを返します。エージェントは、ツールが嘘を告げたのかを知る方法がありません。そのため、下流に悪いデータを自信を持ってプロパゲートします。ツール呼び出しの精度(HTTP 200 だけでなく、期待されるデータ品質が返されたかどうかも測る)は、エージェントデプロイメントにおいて最も仪器化されていないメトリックスの 1 つです。
3. コンテキストウィンドウの飽和 — エージェントセッションが長くなるにつれて、コンテキストウィンドウが満ち、以前のコンテンツがドロップや優先度の低下に遭います。エージェントは会話の最初に表示された重要な制約を効果的に「忘れます」。これは、早期の制約と矛盾する回答が現れる形で表れます。
Original Content
You Can't Fix What You Can't See: The AI Agent Observability Crisis
Most agent deployments track uptime. That's not enough. Here's what production-grade agent observability actually looks like — and the tools that get you there.
Something happened to a production agent pipeline last month that I keep thinking about. The system had been running for three weeks. Error rate: near zero. Latency: nominal. Uptime dashboard: green. Then a user noticed the agent had been recommending the wrong API version in every response since day two. Three weeks of confidently wrong answers, undetected, because every answer was syntactically correct, well-formatted, and returned in under two seconds.
This is the AI agent observability problem in its purest form: your agent can be failing catastrophically while every traditional monitoring metric looks fine.
We've spent this week examining the structural problems in AI agent deployments — memory architectures that silently degrade, multi-agent systems that perform worse at scale. The thread running through both: you can't diagnose these problems without seeing them. And right now, most teams are flying blind.
AI agents fail differently than traditional software — they produce outputs that are structurally valid but semantically wrong, and conventional monitoring has no way to detect this. A crashed service returns a 500 error. An agent that gives subtly incorrect advice returns a 200 with a JSON payload that passes schema validation.
Traditional observability tracks three signals: logs (what happened), metrics (how often and how fast), and traces (how the execution flowed). These are necessary but not sufficient for agents. An agent can hit zero tool errors, complete all steps, and still be useless because it misunderstood the user's intent in step one and confidently propagated that error through nine subsequent steps.
The deeper problem is non-determinism. Unit tests work because the same input always produces the same output. Agents are stochastic — the same prompt can yield meaningfully different reasoning paths. You can't test your way to confidence; you have to observe your way there. This is a fundamentally different discipline, and most engineering teams haven't built the muscle for it yet.
There's also the multi-step failure cascade. A traditional API call either succeeds or fails. An agent workflow might make 12 tool calls, synthesize 4 retrieved documents, and produce 3 intermediate outputs before reaching a conclusion. The final answer might be wrong because step three retrieved the wrong document — but by the time you see the wrong answer, the trace is buried under nine subsequent operations. Pinpointing root cause requires the kind of span-level visibility that most observability tools weren't built to provide.
Agent failures cluster into four distinct categories, each requiring a different detection strategy. Understanding these is the prerequisite to building an observability stack that catches them.
1. Semantic drift — The agent's outputs are technically correct but gradually shift away from the intended behavior over time. This happens most often when the agent has persistent memory and the memory state diverges from reality. A customer support agent trained in January might start reflecting January's product pricing in March.
2. Tool reliability failures — The agent calls external tools correctly but the tools return stale, incorrect, or incomplete data. The agent has no way to know the tool lied to it, so it confidently propagates the bad data downstream. Tool call accuracy — measuring whether tool calls return expected data quality, not just HTTP 200 — is one of the most underinstrumented metrics in agent deployments.
3. Context window saturation — As agent sessions grow longer, the context window fills and earlier content gets dropped or deprioritized. The agent effectively "forgets" critical constraints stated early in the conversation. This manifests as answers that contradict the user's original requirements — which the agent literally no longer has access to.
4. Silent task incompletion — The agent returns a response without completing all required steps. It may have hit a tool error, decided to skip a step, or terminated early — but it formats its partial output as a complete answer. Without step-level tracing, you'll never know which tasks finished and which didn't.
Of these four, semantic drift and silent task incompletion are the most dangerous precisely because they're invisible to traditional monitoring. Latency spikes are obvious. Confident partial answers look like full answers.
The agent observability tooling landscape in 2026 has matured significantly, but no single platform covers all four failure categories equally well. Here's how the major platforms compare across the dimensions that matter most in production:
Platform
Multi-step Tracing
Semantic Evaluation
Tool Call Monitoring
Open Source
Best For
LangSmith
Excellent
Good
Good
No
LangChain-based stacks
Arize Phoenix
Excellent
Good
Excellent
Yes
Framework-agnostic, OTel-native
Galileo
Good
Excellent
Good
No
Semantic quality at scale
Langfuse
Excellent
Good
Good
Yes (self-host)
Cost-conscious teams
Helicone
Basic
Basic
Good
Partial
Quick setup, cost tracking
Braintrust
Good
Excellent
Good
No
Evaluation-first teams
A few observations from working with these in practice:
LangSmith remains the default for LangChain users because the integration is automatic — it understands LangChain's internals and requires almost no setup overhead. The tradeoff is lock-in: if you're not using LangChain, the integration story gets complicated. Pricing starts at $0 for the developer tier and $39/seat for the Plus plan.
Arize Phoenix is the standout open-source option. It uses OpenTelemetry-based tracing via the OpenInference standard, which means it works across virtually any framework. If you're running a multi-framework stack or want to avoid vendor lock-in, Phoenix is the right default. The span-level tracing for tool calls is excellent.
Galileo takes a different approach: instead of logging and letting you analyze manually, it evaluates agent outputs using lightweight models that run on live traffic. The key claim is low latency and low cost for real-time quality evaluation. The tradeoff is opacity — you're trusting Galileo's evaluation models, which adds another AI system to debug.
Helicone is a gateway, not a full observability platform. You route API calls through it (a simple base URL change), and it logs everything immediately. For pure cost tracking and basic request monitoring, nothing is faster to set up. For agent-specific concerns — semantic quality, step-level traces — you'll need to layer something on top.
The honest answer is that most production teams end up combining two tools: a tracing platform (Phoenix, LangSmith, or Langfuse for the execution graph) and an evaluation layer (Galileo or Braintrust for semantic quality). No single tool does both equally well yet.
You can't instrument everything on day one. If you're starting from zero visibility, here's the instrumentation priority order:
1. Span-level traces for every tool call — This is the minimum. Log every external call your agent makes, what it sent, what it received, and how long it took. This alone catches tool reliability failures and gives you the data to debug everything else.
2. Task completion rate — Define what "done" looks like for your agent's tasks and track whether it actually reaches that state. If your rate is below 95%, you have a silent failure problem worth investigating before anything else.
3. Token budget per session — Track cumulative token usage across multi-turn sessions. Set an alert threshold at ~70% of your context window. When sessions habitually approach the limit, you're at risk of context saturation failures on the most complex (and often most important) queries.
4. Output evaluation on a sample — You don't need to evaluate 100% of outputs, but you need to evaluate some. Start with 5–10% of production traffic run through an evaluation model. This catches semantic drift before it compounds.
5. Memory freshness for persistent agents — If your agent has memory that references external data (product info, user state, world knowledge), build a staleness metric. How old is the oldest piece of information your agent might recall? Anything over 7 days in fast-moving domains is a liability.
The sequence matters. Tracing first — you need the data before you can evaluate it. Evaluation second — once you can see what's happening, you can measure whether it's correct.
Agent failures are structurally invisible to traditional monitoring. Uptime, latency, and error rate metrics can all be green while your agent produces consistently wrong outputs. You need a different observability stack.
There are four distinct agent failure modes — semantic drift, tool reliability failures, context window saturation, and silent task incompletion — each requiring different detection strategies.
No single observability platform covers all failure modes equally. Most production teams combine a tracing tool (Phoenix, LangSmith, Langfuse) with a semantic evaluation layer (Galileo, Braintrust).
Task completion rate is the single most underinstrumented metric in agent deployments. Start there before optimizing for anything else.
5% production sampling for semantic evaluation is enough to catch drift without the cost overhead of evaluating everything.
The AI agent field has moved faster on deployment than on operations. We've gotten good at building agents and shipping them. We haven't gotten good at knowing whether they're actually working once they're out in the world.
The most dangerous period for any agent deployment isn't launch — it's week three. The initial excitement has passed, active monitoring attention has moved elsewhere, and the slow failures have had time to compound. By the time a user notices something is wrong, the damage is often weeks old.
The tooling exists. Phoenix, LangSmith, Galileo, Langfuse — none of these are hard to set up. The gap isn't technical. It's cultural: teams treat agent observability as something to add after the agent is "working," when it's actually a prerequisite for knowing if it's working at all.
Build the observability layer before you need it. You'll need it sooner than you think.
AI Agent Digest covers AI agent systems — frameworks, architectures, production patterns, and honest analysis. No hype, no favorites, just what works.