dev_to 2026年3月21日

LangGraph を使って本番環境で AI エージェントを構築する：おもちゃのサンプルをこえる

Building Production AI Agents with LangGraph: Beyond the Toy Examples

Translated: 2026/3/21 8:01:16

Japanese Translation

本番環境での AI エージェントの構築：おもちゃのサンプルをこえるあらゆる AI のチュートリアルは、質問に答えるチャットボットを示唆しますが、それはエージェントではありません。エージェントは意思決定を行い、アクションを取り、結果を観察し、適応します。本番環境では、この全てを信頼性高く実行し、監査証跡、エラー復旧、人間の監視を行います。LangGraph は LangChain から提供されるグラフベースのオーケストレーションレイヤーであり、現在、実際のエージェントを配信するチームの選択肢となっています。Uber はサポートワークフローでこれを、LinkedIn は内部知識エージェントで、Klarna は大規模に顧客向けエージェントを走らせています。この記事は、プロトタイプから本番環境へ移行する際に欲しいガイドです。アップタイムが重要となる際に考慮すべきすべてのパターンを網羅し、エンドエンドの「研究アシスタント」エージェントを構築します。エージェントのコードを 1 行も書く前に、以下の問いを自問してください：このタスクは動的な意思決定を必要とするか？エージェントを使うべきなのは以下の場合です： - 設計時点でステップ数が不明である場合 - 文脈に基づいて複数のツールから選択する必要がある場合 - 中間結果が実行経路を変更する場合 - 自律的なエラー復旧が必要な場合以下の場合はエージェントを使わない方が良いです： - 固定されたパイプライン（プロンプト → LLM →出力）で問題が解決する場合 - 事前にすべてのパスを網羅できる場合（単純なチェーンを使用すべき） - レイテンシ予算が 2 秒未満の場合（エージェントはループするが、ループは遅い） - 不正な自律的アクションのコストが高く、人間のチェックポイントを追加できない場合エージェントは複雑さを加える。よく設計されたチェーンが構造化された出力を持つ場合、悪く設計されたエージェントよりも常に優れる。まずは動く最もシンプルなアプローチから始め、壁にぶつかったらエージェントに昇格させるべきです。 LangGraph はエージェントのロジックを有向グラフとしてモデル化し、以下の構造を持ちます： - ステート（State）：グラフを流れ続ける型付けされた辞書 - ノード（Nodes）：ステートを読み書きする関数 - エッジ（Edges）：ノードを接続（静的または条件付き）条件付きエッジはステートを検出し、異なるノードへのルーティングを行います。以下の最小限のメンタルモデルを理解してください： from langgraph.graph import StateGraph, START, END from typing import TypedDict, Annotated from operator import add class AgentState(TypedDict): messages: Annotated[list, add] step_count: int def process(state: AgentState) -> dict: return {"messages": ["processed"], "step_count": state["step_count"] + 1} def should_continue(state: AgentState) -> str: return "end" if state["step_count"] >= 3 else "process" graph = StateGraph(AgentState) graph.add_node("process", process) graph.add_conditional_edges(START, should_continue, {"process": "process", "end": END}) graph.add_conditional_edges("process", should_continue, {"process": "process", "end": END}) app = graph.compile() result = app.invoke({"messages": [], "step_count": 0}) Annotated[list, add] は極めて重要です——これは LangGraph にリストの返却をマージさせるように指示します。これなしには、各ノードが以前のメッセージを上書きすることになります。実際に構築すべきものです：研究質問を受け取り、ウェブを検索し、関連するページを読み取って要約し、構造化されたレポートを生成するエージェント。これは企業が実際に導入するエージェントです。 from typing import TypedDict, Annotated, Literal from operator import add from pydantic import BaseModel class Source(BaseModel): url: str title: str summary: str relevance_score: float class ResearchState(TypedDict): question: str search_queries: list[str] sources: Annotated[list[Source], add] draft_report: str critique: str final_report: str iteration: int status: str Source モデルは Pydantic を使用——これは状態データベースへ永続化を行う際に重要で、検証とシリアル化を無料で提供します。 from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage from langchain_community.tools.tavily_search import TavilySearchResults llm = ChatOpenAI(model="gpt-4o", temperature=0) search_tool = TavilySearchResults(max_results=5) async def generate_queries(state: ResearchState) -> dict: """研究質問をターゲッティング検索クエリに変換します。""" response = await llm.ainvoke([Sys

Original Content

Building Production AI Agents with LangGraph: Beyond the Toy Examples Every AI tutorial shows you a chatbot that answers questions. That's not an agent. An agent decides what to do, takes action, observes the result, and adapts. In production, it does all of that reliably, with audit trails, error recovery, and human oversight. LangGraph — the graph-based orchestration layer from LangChain — has quietly become the framework of choice for teams shipping real agents. Uber routes support workflows through it. LinkedIn uses it for internal knowledge agents. Klarna runs customer-facing agents on it at scale. This article is the guide I wish I had when I moved from prototype to production. We'll build a Research Assistant agent end-to-end, covering every pattern that matters when uptime counts. Before writing a single line of agent code, ask yourself: does this task require dynamic decision-making? Use agents when: The number of steps is unknown at design time The task requires selecting from multiple tools based on context Intermediate results change the execution path You need autonomous error recovery Don't use agents when: A fixed pipeline (prompt → LLM → output) solves the problem You can enumerate all paths in advance (use a simple chain) Latency budget is under 2 seconds (agents loop; loops are slow) The cost of a wrong autonomous action is high and you can't add human checkpoints Agents add complexity. A well-designed chain with structured outputs will outperform a poorly-designed agent every time. Start with the simplest approach that works, then graduate to agents when you hit the wall. LangGraph models agent logic as a directed graph where: State is a typed dictionary that flows through the graph Nodes are functions that read and write state Edges connect nodes (static or conditional) Conditional edges inspect state and route to different nodes Here's the minimal mental model: from langgraph.graph import StateGraph, START, END from typing import TypedDict, Annotated from operator import add class AgentState(TypedDict): messages: Annotated[list, add] # append-only message list step_count: int def process(state: AgentState) -> dict: return {"messages": ["processed"], "step_count": state["step_count"] + 1} def should_continue(state: AgentState) -> str: return "end" if state["step_count"] >= 3 else "process" graph = StateGraph(AgentState) graph.add_node("process", process) graph.add_conditional_edges(START, should_continue, {"process": "process", "end": END}) graph.add_conditional_edges("process", should_continue, {"process": "process", "end": END}) app = graph.compile() result = app.invoke({"messages": [], "step_count": 0}) The Annotated[list, add] is critical — it tells LangGraph to merge list returns instead of overwriting. Without it, each node would clobber the previous messages. Let's build something real: an agent that takes a research question, searches the web, reads and summarizes relevant pages, and produces a structured report. This is the kind of agent companies actually deploy. from typing import TypedDict, Annotated, Literal from operator import add from pydantic import BaseModel class Source(BaseModel): url: str title: str summary: str relevance_score: float class ResearchState(TypedDict): question: str search_queries: list[str] sources: Annotated[list[Source], add] draft_report: str critique: str final_report: str iteration: int status: str I'm using Pydantic models for Source — this gives you validation and serialization for free, which matters when you're persisting state to a database. from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage from langchain_community.tools.tavily_search import TavilySearchResults llm = ChatOpenAI(model="gpt-4o", temperature=0) search_tool = TavilySearchResults(max_results=5) async def generate_queries(state: ResearchState) -> dict: """Turn the research question into targeted search queries.""" response = await llm.ainvoke([ SystemMessage(content="Generate 3 specific search queries to research this topic. Return only the queries, one per line."), HumanMessage(content=state["question"]) ]) queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()] return {"search_queries": queries, "status": "searching"} async def search_web(state: ResearchState) -> dict: """Execute searches and collect sources.""" all_sources = [] for query in state["search_queries"]: results = await search_tool.ainvoke({"query": query}) for r in results: source = Source( url=r["url"], title=r.get("title", ""), summary=r["content"][:500], relevance_score=0.0 # scored in next step ) all_sources.append(source) return {"sources": all_sources, "status": "analyzing"} async def write_report(state: ResearchState) -> dict: """Synthesize sources into a structured report.""" source_text = "\n\n".join( f"[{s.title}]({s.url})\n{s.summary}" for s in state["sources"] ) response = await llm.ainvoke([ SystemMessage(content="""Write a detailed research report based on these sources. Structure: Executive Summary, Key Findings (numbered), Analysis, Conclusion. Cite sources inline as [1], [2], etc."""), HumanMessage(content=f"Question: {state['question']}\n\nSources:\n{source_text}") ]) return {"draft_report": response.content, "status": "reviewing"} async def critique_report(state: ResearchState) -> dict: """Self-critique the draft for gaps and improvements.""" response = await llm.ainvoke([ SystemMessage(content="""Review this research report critically. Identify: 1. Factual gaps or unsupported claims 2. Missing perspectives 3. Areas needing more depth Be specific and actionable. If the report is solid, say "APPROVED"."""), HumanMessage(content=state["draft_report"]) ]) return { "critique": response.content, "iteration": state["iteration"] + 1, "status": "critiqued" } async def revise_report(state: ResearchState) -> dict: """Revise the report based on critique.""" response = await llm.ainvoke([ SystemMessage(content="Revise this report to address the critique. Maintain the same structure."), HumanMessage(content=f"Report:\n{state['draft_report']}\n\nCritique:\n{state['critique']}") ]) return {"draft_report": response.content, "status": "revised"} async def finalize(state: ResearchState) -> dict: return {"final_report": state["draft_report"], "status": "complete"} from langgraph.graph import StateGraph, START, END def route_after_critique(state: ResearchState) -> Literal["revise", "finalize"]: if "APPROVED" in state["critique"] or state["iteration"] >= 3: return "finalize" return "revise" builder = StateGraph(ResearchState) # Add nodes builder.add_node("generate_queries", generate_queries) builder.add_node("search_web", search_web) builder.add_node("write_report", write_report) builder.add_node("critique_report", critique_report) builder.add_node("revise_report", revise_report) builder.add_node("finalize", finalize) # Add edges builder.add_edge(START, "generate_queries") builder.add_edge("generate_queries", "search_web") builder.add_edge("search_web", "write_report") builder.add_edge("write_report", "critique_report") builder.add_conditional_edges("critique_report", route_after_critique) builder.add_edge("revise_report", "critique_report") # loop back builder.add_edge("finalize", END) research_agent = builder.compile() Run it: result = await research_agent.ainvoke({ "question": "What are the most effective strategies for reducing LLM hallucinations in production systems?", "search_queries": [], "sources": [], "draft_report": "", "critique": "", "final_report": "", "iteration": 0, "status": "starting" }) print(result["final_report"]) In production, agents crash. Servers restart. Users close browsers. You need checkpointing. LangGraph has built-in support for persisting state at every step via checkpointers: from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver DB_URI = "postgresql://user:pass@localhost:5432/agents" async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer: await checkpointer.setup() # creates tables on first run research_agent = builder.compile(checkpointer=checkpointer) # Every invocation now saves state after each node config = {"configurable": {"thread_id": "research-001"}} result = await research_agent.ainvoke(initial_state, config) If the process dies mid-execution, restart with the same thread_id and it picks up exactly where it left off: # Resume from last checkpoint result = await research_agent.ainvoke(None, config) Production tip: Use thread_id as your correlation ID across logging, tracing, and customer support. When a user reports a problem, you can replay the exact state transitions. For high-throughput systems, the Postgres checkpointer supports connection pooling. For simpler setups, SqliteSaver works fine. For serverless, use the MemorySaver during development but always switch to a durable store before deploying. Fully autonomous agents are a liability in production. The most reliable pattern is human-on-the-loop: the agent runs autonomously but pauses at critical decision points. LangGraph supports this natively with interrupt: from langgraph.types import interrupt, Command async def write_report(state: ResearchState) -> dict: # ... generate draft ... # Pause and wait for human approval approval = interrupt({ "question": "Review this draft report. Reply 'approved' or provide feedback.", "draft": draft_content }) if approval.lower() != "approved": # Human provided feedback — use it as critique return {"draft_report": draft_content, "critique": approval, "status": "human_feedback"} return {"draft_report": draft_content, "status": "approved"} On the calling side, you handle the interrupt: config = {"configurable": {"thread_id": "research-001"}} # First invocation runs until interrupt result = await research_agent.ainvoke(initial_state, config) # Agent is now paused. Show draft to user via your UI. # When user responds: result = await research_agent.ainvoke( Command(resume="approved"), # or resume="Add more detail about X" config ) This pattern maps cleanly to web UIs (show a review screen), Slack bots (send a message and wait for reply), or email workflows. Advanced pattern — tiered autonomy: def route_by_confidence(state: ResearchState) -> str: confidence = state.get("confidence_score", 0) if confidence > 0.9: return "auto_approve" # agent proceeds elif confidence > 0.7: return "notify_human" # agent proceeds but flags for review else: return "require_approval" # agent pauses This lets low-risk actions flow through while escalating uncertain ones — the sweet spot for production throughput. Tools are how agents interact with the real world. Get this wrong and you get agents that burn API credits, leak data, or take destructive actions. from langchain_core.tools import tool from pydantic import Field @tool def search_knowledge_base( query: str = Field(description="Natural language search query"), filters: dict | None = Field(default=None, description="Optional metadata filters: {department: str, date_range: str}"), max_results: int = Field(default=10, ge=1, le=50, description="Number of results to return") ) -> list[dict]: """Search the internal knowledge base for documents matching the query. Use this for company-specific information. For general web information, use web_search instead.""" # implementation ... Key practices: Rich descriptions matter more than you think. The LLM reads the docstring and field descriptions to decide when and how to call the tool. Vague descriptions lead to wrong tool selection. Constrain inputs. Use ge, le, enums, and Pydantic validators. An agent that can pass max_results=10000 will eventually do it. Separate read and write tools. Never have a single database_tool that can both query and delete. Give the agent db_query and db_delete separately, and only bind db_delete when you've added human approval. Tool result formatting. Return structured data, not free text. The LLM processes structured results more reliably: @tool def get_order_status(order_id: str) -> dict: """Look up the status of a customer order.""" order = db.get_order(order_id) return { "order_id": order.id, "status": order.status, "items_count": len(order.items), "estimated_delivery": order.eta.isoformat(), "action_available": ["cancel"] if order.status == "processing" else [] } Bind tools selectively per node. Not every node needs every tool: research_llm = llm.bind_tools([search_tool, scrape_tool]) writing_llm = llm.bind_tools([]) # no tools during writing Production agents face three categories of failures: Use LangGraph's built-in retry policy: from langgraph.pregel import RetryPolicy builder.add_node( "search_web", search_web, retry=RetryPolicy( max_attempts=3, initial_interval=1.0, # seconds backoff_factor=2.0, retry_on=(TimeoutError, RateLimitError) ) ) Wrap tool execution with validation: async def safe_tool_executor(state: AgentState) -> dict: last_message = state["messages"][-1] for tool_call in last_message.tool_calls: try: # Validate tool exists tool = tool_map.get(tool_call["name"]) if not tool: return {"messages": [ToolMessage( content=f"Tool '{tool_call['name']}' does not exist. Available: {list(tool_map.keys())}", tool_call_id=tool_call["id"] )]} # Execute with timeout result = await asyncio.wait_for( tool.ainvoke(tool_call["args"]), timeout=30.0 ) return {"messages": [ToolMessage(content=str(result), tool_call_id=tool_call["id"])]} except ValidationError as e: return {"messages": [ToolMessage( content=f"Invalid arguments: {e}. Please fix and retry.", tool_call_id=tool_call["id"] )]} The agent sees the error message and self-corrects on the next iteration. This works surprisingly well — LLMs are good at fixing their own mistakes when given clear error messages. Guard against these at the graph level: def route_after_critique(state: ResearchState) -> str: # Hard cap on iterations if state["iteration"] >= 3: return "finalize" # Detect stuck state: same critique twice if state.get("prev_critique") == state["critique"]: return "finalize" return "revise" Also set a global timeout on the entire graph execution: result = await asyncio.wait_for( research_agent.ainvoke(initial_state, config), timeout=300.0 # 5 minute hard limit ) You cannot operate what you cannot see. LangSmith is the observability layer for LangGraph — think Datadog for agent workflows. Setup is two environment variables: export LANGCHAIN_TRACING_V2=true export LANGCHAIN_API_KEY=lsv2_... Every node execution, tool call, LLM invocation, and state transition is now traced automatically. No code changes required. What to monitor in production: # Custom metadata for filtering traces config = { "configurable": {"thread_id": "research-001"}, "metadata": { "user_id": "u_12345", "environment": "production", "agent_version": "2.1.0" }, "tags": ["research", "priority-high"] } Key metrics to track: Tokens per task: Set budgets. A research agent shouldn't exceed 50k tokens per run. Alert if it does. Iterations per completion: If your average is climbing, your prompts or critique logic are degrading. Tool call success rate: Below 95%? Your tool descriptions need work. Time to completion: Set SLOs. p50 under 30s, p99 under 120s. Human intervention rate: Track how often agents escalate. Trending up = model or prompt regression. Trending down = your agent is learning (or your thresholds are too loose). LangSmith also supports evaluation datasets — curated input/output pairs that you run nightly to catch regressions: from langsmith import Client client = Client() # Create a dataset of expected research outputs dataset = client.create_dataset("research-agent-evals") client.create_example( inputs={"question": "What is retrieval augmented generation?"}, outputs={"expected_sections": ["Executive Summary", "Key Findings"]}, dataset_id=dataset.id ) The framework landscape has matured significantly. Here's when to use what: Aspect LangGraph CrewAI AutoGen Architecture Graph-based, explicit control flow Role-based multi-agent Conversation-based multi-agent Best for Complex workflows, production systems Team simulation, parallel task delegation Research, multi-agent debate State management Built-in, typed, persistent Limited, via shared memory Conversation history Human-in-the-loop First-class (interrupt) Basic approval flows Chat-based intervention Observability LangSmith native Basic logging AutoGen Studio Learning curve Moderate (graph concepts) Low (intuitive role metaphor) Low-moderate Production readiness High Medium Medium Choose LangGraph when: You need fine-grained control over execution flow Persistence and checkpointing are requirements You're building a single agent with complex routing You need production-grade observability Choose CrewAI when: Your problem naturally decomposes into roles (researcher, writer, reviewer) You want rapid prototyping of multi-agent systems Team-based delegation is the core pattern Choose AutoGen when: You're building conversational multi-agent systems Agents need to debate or negotiate Research and experimentation are the primary goals Hybrid approach (what I recommend): Use LangGraph as the orchestration layer and implement individual "agents" within it as specialized nodes. You get the reliability of graph-based control flow with the flexibility to swap implementations. // langgraph.json { "graphs": { "research_agent": "./agent.py:research_agent" }, "dependencies": ["langchain-openai", "tavily-python"], "env": ".env" } langgraph dev # local development server with hot reload langgraph build # Docker image for deployment langgraph deploy # deploy to LangGraph Cloud The platform gives you a REST API, WebSocket streaming, cron triggers, and a built-in task queue — eliminating significant infrastructure work. Never make users stare at a spinner. Stream intermediate state: async for event in research_agent.astream_events(initial_state, config, version="v2"): if event["event"] == "on_chat_model_stream": # Token-level streaming for the writing step print(event["data"]["chunk"].content, end="", flush=True) elif event["event"] == "on_chain_end": # Node completion events node_name = event.get("name", "") print(f"\n[Completed: {node_name}]") import tiktoken class TokenBudget: def __init__(self, max_tokens: int = 50_000): self.max_tokens = max_tokens self.used = 0 self.encoder = tiktoken.encoding_for_model("gpt-4o") def check(self, text: str) -> bool: tokens = len(self.encoder.encode(text)) self.used += tokens if self.used > self.max_tokens: raise TokenBudgetExceeded(f"Used {self.used}/{self.max_tokens} tokens") return True Wire this into your LLM callbacks. When an agent hits its budget, force it to the finalize step with whatever it has. Never hardcode prompts in your node functions. Use a prompt registry: from langsmith import Client client = Client() # Pull versioned prompts from LangSmith Hub system_prompt = client.pull_prompt("research-agent/critique:v3") This lets you A/B test prompts, roll back bad deployments, and track which prompt version produced which outputs. Build fallback paths into your graph: def route_search_results(state: ResearchState) -> str: if not state["sources"]: return "fallback_generate" # LLM generates from knowledge if len(state["sources"]) < 3: return "search_again" # try different queries return "write_report" # proceed normally An agent that returns a partial result is infinitely more useful than one that throws a 500. The gap between an agent demo and a production agent is the same gap between a script and a service — error handling, observability, persistence, and operational controls. LangGraph gives you the primitives to bridge that gap: typed state, persistent checkpoints, conditional routing, human-in-the-loop interrupts, and native observability. It's opinionated enough to prevent common mistakes but flexible enough to model real workflows. Start with the simplest graph that solves your problem. Add checkpointing on day one — you'll thank yourself the first time a process crashes mid-run. Add human approval gates before any destructive action. Monitor token usage religiously. And version everything: prompts, tools, graph topology. The agents that succeed in production aren't the cleverest ones — they're the most predictable ones. If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.