dev_to 2026年3月16日

エージェントが失敗した時、ただ止まってしまうのか？

when your agent fails, does it just... stop?

Translated: 2026/3/16 14:01:49

ai-agentsself-healingdesktop-automationruby-vibratvisual-verif

Japanese Translation

私はこの投稿を Gemini Live Agent Challenge への参加目的で作成しましたが、この特定の課題——AI の動作が失敗したとき——それはエージェント構築者の誰もが解決しなければならない問題です。最も一般的なデスクトップ自動化ツールには、不潔な秘密があります：それらは脆いです。間違ったピクセルをクリックしたり、要素の位置が変わったり、予期せぬダイアログに遭遇したりしたら、一連の動作全体が崩壊します。ユーザーは「エラー」を見て、キーボードを探し出します。 VibeCat の自己修復エンジンは、私たちが猫が諦める様子を見ていることになんでも疲れたため、構築されました。数百件のテストシーケンスを 3 つのアプリ（Antigravity IDE、Terminal、Chrome）で実行した後、我々は失敗モードを記録しました: * AX ターゲットが見つからない: アクセシビリティ API が要素が存在しないと言っています。通常はアプリがレンダリング完了していない場合か、要素がキャンバス/WebGL 表面内の場合です。頻度: Chrome で最初の試行の約 15%。 * AX ターゲットが見つかったが正しくない: 要素は存在するが、そのものは違うものです。「再生」ボタンが別のパネルにある場合や、見た目正しいテキストフィールドが別のコンポーネントのものである場合。 * クリックが着地したが何も起こらない: 座標は正しく、クリックは発火しましたが、UI は反応しませんでした。YouTube の遅延イベントハンドラーとよくあります。 * 動作は成功したが検証は失敗した: VibeCat は文字を入力しましたが表示され、アクション後のスクリーンショットにはエラーダイアログや予期せぬ状態が表示されました。自己修復エンジンは意図的にシンプルです。複雑な状態機械も機械学習もありません。ただ 2 つのルールだけ: * ステップごと最大 2 回の試行。3 回失敗したら停止し、ユーザーに知らせる。 * 各試行は異なるグリウンディングソースを使用。すでに失敗したものを繰り返さない。試行 1: AX ターゲット化 → 失敗: 要素が AX ツリーに不在試行 2: CDP ターゲット化 (chromedp) → 失敗: Chrome DevTools は一致する DOM ノードが見つからない試行 3: ビジョン座標 (Gemini スクリーンショット分析) → 成功: (847, 423) でクリック、検証通過グリウンディングソースの優先度チェーンは AX → CDP → Vision です。しかし、エンジンは適用できないソースをスキップするスマートです—you は Terminal にいる場合（ブラウザなし）、CDP は完全にスキップされます。手動実行時のハンドラー.go の核心ロジック: func (h *Handler) executeWithHealing(ctx context.Context, step *Step) error { sources := []GroundingSource{AX, CDP, Vision} for attempt := 0; attempt <= maxRetries; attempt++ { source := sources[min(attempt, len(sources)-1)] err := h.executeStep(ctx, step, source) if err == nil { verified, verifyErr := h.verifyStep(ctx, step) if verifyErr == nil && verified { return nil } } h.emitProcessingState("retrying_step", step, attempt+1) slog.Info("self-healing retry", "step", step.ID, "attempt", attempt+1, "failed_source", source, "next_source", sources[min(attempt+1, len(sources)-1)]) } return fmt.Errorf("step %s failed after %d attempts", step.ID, maxRetries+1) } すべてのアクション——テキスト入力、ボタンのクリック、または URL の開封——は検証ステップで終わります。VibeCat は新鮮なスクリーンショットを撮り、特定の質問「この動作は成功しましたか？」を含めて ADK オーケストレーターに送ります。これは単に「クリックが記録されたか」のことではありません。これはセマンティックな検証です: * Terminal で"go vet ./..."を入力した後 → 命令出力が「問題なし」を示すことを検証 * YouTube Music の再生ボタンをクリックした後 → 動画要素がもう止まっていないことを検証 * URL を開いた後 → 期待されるページコンテンツが見えることを検証 ADK オーケストレーターはこの分析のために Gemini のビジョンモデルを使用します。それは信頼性をスコアと自然言語の説明を返します。信頼性が閾値を下回れば、ステップは失敗とマークされ、自己修復が働き始めます。 verification: { "success": false, "confidence": 0.3, "explanation": "The play button appears unchanged.

Original Content

when your agent fails, does it just... stop? I created this post for the purposes of entering the Gemini Live Agent Challenge. But this particular problem — what happens when an AI action fails — is something every agent builder needs to solve. Most desktop automation tools have a dirty secret: they're fragile. Click the wrong pixel, target an element that moved, or encounter an unexpected dialog — and the whole sequence collapses. The user sees "Error" and reaches for the keyboard. VibeCat's self-healing engine was built because we got tired of watching our cat give up. After running hundreds of test sequences across three apps (Antigravity IDE, Terminal, Chrome), we cataloged the failure modes: AX target not found — The Accessibility API says the element doesn't exist. Usually because the app hasn't finished rendering, or because the element is inside a canvas/WebGL surface. Frequency: ~15% of first attempts on Chrome. AX target found but wrong — The element exists but it's the wrong one. A "Play" button that's actually in a different panel, or a text field that looks right but belongs to a different component. Frequency: ~5%. Click landed but nothing happened — The coordinates were correct, the click fired, but the UI didn't respond. Common with YouTube's debounced event handlers. Frequency: ~10% on YouTube Music. Action succeeded but verification failed — VibeCat typed the text and it appeared, but the post-action screenshot shows an error dialog or unexpected state. Frequency: ~3%. The self-healing engine is deliberately simple. No complex state machines, no machine learning. Just two rules: Max 2 retries per step. If it fails three times, stop and tell the user. Each retry uses a different grounding source. Don't repeat what already failed. Attempt 1: AX targeting → Failed: element not in AX tree Attempt 2: CDP targeting (chromedp) → Failed: Chrome DevTools can't find matching DOM node Attempt 3: Vision coordinates (Gemini screenshot analysis) → Success: clicked at (847, 423), verification passed The grounding source priority chain is AX → CDP → Vision. But the engine is smart enough to skip sources that don't apply — if you're in Terminal (no browser), CDP is skipped entirely. Here's the core logic in handler.go: func (h *Handler) executeWithHealing(ctx context.Context, step *Step) error { sources := []GroundingSource{AX, CDP, Vision} for attempt := 0; attempt <= maxRetries; attempt++ { source := sources[min(attempt, len(sources)-1)] err := h.executeStep(ctx, step, source) if err == nil { verified, verifyErr := h.verifyStep(ctx, step) if verifyErr == nil && verified { return nil } } h.emitProcessingState("retrying_step", step, attempt+1) slog.Info("self-healing retry", "step", step.ID, "attempt", attempt+1, "failed_source", source, "next_source", sources[min(attempt+1, len(sources)-1)]) } return fmt.Errorf("step %s failed after %d attempts", step.ID, maxRetries+1) } Every action — whether it's typing text, clicking a button, or opening a URL — ends with a verification step. VibeCat captures a fresh screenshot and sends it to the ADK Orchestrator with a specific question: "Did the action succeed?" This isn't just "did the click register?" It's semantic verification: After typing "go vet ./..." in Terminal → verify the command output shows "no issues" After clicking Play on YouTube Music → verify the video element is no longer paused After opening a URL → verify the expected page content is visible The ADK Orchestrator uses Gemini's vision model for this analysis. It returns a confidence score and a natural-language explanation. If confidence is below the threshold, the step is marked as failed and healing kicks in. verification: { "success": false, "confidence": 0.3, "explanation": "The play button appears unchanged. The video progress bar has not moved." } → trigger retry with CDP grounding One subtle failure mode we discovered: Gemini sometimes issues multiple function calls in rapid succession. "Focus Terminal, then type go vet ./..., then press Enter." If these execute in parallel, go vet might get typed into the wrong window because focus_app hasn't completed yet. The pendingFC mechanism solves this with strict sequential execution: Gemini sends FC calls → queued in pendingFC Gateway sends step 1 to client Client executes, captures verification screenshot Gateway confirms step 1 → sends step 2 Repeat until queue is empty No step starts until the previous step's verification passes. This adds latency (~200ms per step for verification) but eliminates an entire class of race condition bugs. The most impactful design decision wasn't technical — it was UX. VibeCat narrates every step through the overlay panel: 🔍 Reading screen... 📋 Planning 3 steps ▶️ Step 1/3: Focusing Terminal [AX] ⚠️ Retrying Step 1 — switching to CDP ✅ Step 1/3: Terminal focused ▶️ Step 2/3: Typing command... Users who watched VibeCat fail silently reported it as "broken." Users who watched the same failure with narration reported it as "working through a problem." Same outcome, completely different perception. The seven processing stages (analyzing_command, planning_steps, executing_step, verifying_result, retrying_step, completing, observing_screen) each have localized labels in English, Korean, and Japanese. The overlay shows a grounding source badge (AX / Vision / Hotkey / System) so you always know how VibeCat is interacting with your screen. After implementing self-healing, our end-to-end success rates across 50 test runs: Scenario Without healing With healing YouTube Music play 62% 94% Code comment enhancement 88% 100% Terminal go vet 91% 100% The remaining 6% failure on YouTube Music is almost entirely due to network latency — the page hasn't finished loading when VibeCat tries to click. A simple "wait for page ready" check would probably push it to 98%+. Self-healing isn't about being clever. It's about being systematic. Catalog your failures, build a fallback chain, verify every step, and tell the user what's happening. The hard part isn't the retry logic — it's the verification. Without reliable post-action verification, you're just clicking blindly and hoping. And narrate everything. Always narrate everything. Silent AI feels broken. Transparent AI feels collaborative.