dev_to 2026年4月24日

OpenClaw の記憶デーモン：私が Bedrock のプロンプトキャッシュを正しく実装する方法

Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right

Translated: 2026/4/24 23:00:38

bedrockagent-memory-daemonprompt-cachingcost-optimizationllm-inference

Japanese Translation

Amazon Bedrock で AI エージェントを実行し、すべての会話に持続的なメモリを注入する際、そのメモリアッチの位置は、エージェントの利用効率とコストの双方に大きく影響します。このことは、OpenClaw が Amazon Bedrock AgentCore Runtime で動作する際に記憶デーモンと接続する際に、直感的な方法で学びました。このセットアップは美しく機能していますが、途中、記憶注入とプロンプトキャッシュの間に微妙な相互作用があり、それを記録する価値があります。この投稿では、アーキテクチャ、私を困惑させた Bedrock のプロンプトキャッシュルール、およびキャッシュ関連コストを劇的に削減した 1 ラインの修正方法を通ります。 OpenClaw は AgentCore Runtime 上のコンテナにいます。AgentCore はアイドル時にコンテナを停止させ、これはコスト削減（アイドル時の費用ゼロ）には優れていますが、長期記憶にとっては不寛容です（すべての起動は空白のスレートから始めます）。agent-memory-daemon は、同じコンテナ内のバックグラウンドプロセスとして実行されることで、これを解決します。その作業は 2 つです： * 抽出：セッション変訳ディレクトリを見守り、記憶すべき事実、決定、好みを抽出し、YAML フロントマター付きの個別の Markdown ファイルとして記述します。 * 統合：定期的にメモリードレクトリを再整理します：重複を融合し、矛盾を解決し、陳腐なコンテンツを剪定し、堅固なサイズ予算の下に緊縮された MEMORY.md インデックスを維持します。メモリは呼び出し間、S3 間で同期されます。新しい会話が始まると、コンテナはメモリードレクトリを復元し、MEMORY.md を読み取ってエージェントを状況に合わせます。デーモン自体は安価です。1 日間に数回の Haiku 呼び出しを行うだけで、私の設定ではデーモンの LLM 使用に約 $0.25/月を目指しています。魔法は、それが生成するものにあります：セッション数を問わず、常に約 18KB の、厳選されサイズ予算付きの MEMORY.md です。 Discord → EC2 bot → AgentCore Runtime → container ├── openclaw (エージェント) ├── agent-memory-daemon (キュレーター) └── server.py (HTTP グルー + S3 同期) デーモンはファイルを記述します。エージェントはファイルを読みます。ファイルシステムがインターフェースです。SDK や API、_coupling_はありません。すべての呼び出しで、私は S3 から MEMORY.md を読み取り、OpenClaw にコンテキストとして渡します。私の最初のバージョンは以下のようでした： memory_context = load_memory_from_s3() # カリキュレーテッドメモリで約 18KB effective_message = message if memory_context: effective_message = ( f"[LONG-TERM MEMORY - 前回のセッションから持続したメモリ]\n\n" f"{memory_context}\n\n" f"[END OF MEMORY]\n\n" f"ユーザーメッセージ：{message}" ) messages = [{"role": "user", "content": effective_message}] 私としては、メモリアッチをユーザーメッセージの中に埋め込みました。エージェントはそれを視認し、私の好みを記憶しました。すべて正常に機能しました。また、OpenClaw の設定を通じて Bedrock のプロンプトキャッシュを有効化しました： { "agents": { "defaults": { "models": { "amazon-bedrock/...claude-haiku-4-5...": { "params": { "cacheRetention": "short" } } } } } } Claude Haiku 4.5 は「short」リテンションモードで 5 分の TTL を持つプロンプトキャッシュをサポートします。キャッシュ読み取りは、通常の入力率の約 10% で課金されます。原理上、私の 18KB のメモリ（約 4,500 トークン）は、最初の回以降、すべてのターンで 10 分の 1 の価格でキャッシュから提供されるはずです。次に Cost Explorer を見ました。 3 日の使用量をトークンタイプごとに分解： Line item トークン（百万）費用 Cache Read 12.69 $1.40 Cache Write 7.09 $9.75 Input (uncached) 31.91 $35.10 Output 4.72 $25.96 「Input (uncached)」の行は、キャッシュが機能している場合の論理に整合しません。私は 12.69M のキャッシュ読み取りを持っており、何かがキャッシュされていることを意味しました。OpenClaw の内部システムプロンプトが適切にキャッシュされ、31.91M のトークンはフルの入力価格を支払いました。それらはどこから来たのでしょうか？ここで、人々の頭を混乱させるルールがあります。

Original Content

If you're running an AI agent on Amazon Bedrock and injecting persistent memory into every conversation, where you put that memory in the request matters a lot — both for how well the agent uses it and for what it costs you. I learned this the direct way while connecting agent-memory-daemon to OpenClaw running on Amazon Bedrock AgentCore Runtime. The setup works beautifully. My agent now remembers my preferences, my projects, and the weird Bedrock timeout I debugged three weeks ago. But along the way I hit a subtle interaction between memory injection and prompt caching that's worth documenting. This post walks through the architecture, the Bedrock prompt caching rule that tripped me up, and the one-line fix that cut my cache-related costs dramatically. OpenClaw lives in a container on AgentCore Runtime. AgentCore freezes the container when idle, which is great for cost (zero idle spend) but hostile to long-term memory (every wake is a blank slate). agent-memory-daemon solves this by running as a background process in the same container, doing two things: Extraction — watches the session transcript directory and pulls out facts, decisions, and preferences worth remembering. Writes them as individual markdown files with YAML frontmatter. Consolidation — periodically reorganizes the memory directory: merges duplicates, resolves contradictions, prunes stale content, and maintains a concise MEMORY.md index under a strict size budget. Memory is synced to S3 between invocations. When a new conversation starts, the container restores the memory directory and reads MEMORY.md to bring the agent up to speed. The daemon itself is cheap. It makes a few Haiku calls per day — my config targets about $0.25/month for the daemon's own LLM usage. The magic happens in what it produces: a curated, size-budgeted MEMORY.md that's always ~18KB regardless of how many sessions the agent has had. Discord → EC2 bot → AgentCore Runtime → container ├── openclaw (the agent) ├── agent-memory-daemon (curator) └── server.py (HTTP glue + S3 sync) The daemon writes files. The agent reads files. The filesystem is the interface. No SDK, no API, no coupling. On every invocation, I load MEMORY.md from S3 and pass it to OpenClaw as context. My first version looked like this: memory_context = load_memory_from_s3() # ~18KB of curated memory effective_message = message if memory_context: effective_message = ( f"[LONG-TERM MEMORY - persisted memory from previous sessions]\n\n" f"{memory_context}\n\n" f"[END OF MEMORY]\n\n" f"User message: {message}" ) messages = [{"role": "user", "content": effective_message}] I stuffed the memory into the user message. The agent saw it. It remembered my preferences. Everything worked. I also had Bedrock prompt caching enabled through OpenClaw's config: { "agents": { "defaults": { "models": { "amazon-bedrock/...claude-haiku-4-5...": { "params": { "cacheRetention": "short" } } } } } } Claude Haiku 4.5 supports prompt caching with a 5-minute TTL on the "short" retention mode. Cache reads are billed at ~10% of the regular input rate. On paper, my 18KB memory (~4,500 tokens) should have been getting served from cache at roughly a tenth of the price on every turn after the first. Then I looked at Cost Explorer. Three days of usage, broken down by token type: Line item Tokens (millions) Cost Cache Read 12.69 $1.40 Cache Write 7.09 $9.75 Input (uncached) 31.91 $35.10 Output 4.72 $25.96 The "Input (uncached)" line is the one that doesn't make sense if caching is working. I had 12.69M cache reads, which meant something was being cached — OpenClaw's internal system prompt was getting cached fine. But 31.91M tokens were paying full input price. Where were they coming from? Here's the rule that trips people up: Bedrock prompt caching caches a stable prefix. It looks at the beginning of the request, finds the longest chunk that's identical to a previously-cached request, and serves that from cache. Everything after the divergence point is recomputed and billed as regular input. Now look at my code again: messages = [{"role": "user", "content": effective_message}] effective_message is "[LONG-TERM MEMORY]...18KB of memory...User message: {message}". The user's actual question is appended at the end. What Bedrock sees: Turn 1: messages[0].content = "[MEMORY]...same 18KB...User message: what time is it?" Turn 2: messages[0].content = "[MEMORY]...same 18KB...User message: tell me a joke" Those two strings share a stable 18KB prefix of memory content, but they're both in messages[0].content. The cacheable prefix is actually the system prompt that OpenClaw builds on top — OpenClaw's own system content, its tool definitions, its skill metadata. Once the request stream reaches the user message, Bedrock sees variance (the actual user question) and stops caching. So the memory was sitting in a position where it couldn't be cached. Every turn paid full price for those 4,500 tokens. The change is small. Move the memory to a system message, before the user message: messages = [] if memory_context: messages.append({ "role": "system", "content": ( "You have access to long-term memory from previous sessions. " "Use this to answer questions about the user's preferences and history.\n\n" f"{memory_context}" ), }) messages.append({"role": "user", "content": message}) Now the memory is part of the stable system prefix. It sits alongside OpenClaw's own system prompt, tool definitions, and skills — the stuff that genuinely doesn't change between turns. Bedrock sees the same system block on every request and serves it from cache at 10% of the regular rate. A one-line architectural change. A 90% discount on the biggest line item in the bill. After deploying, I asked OpenClaw for its usage stats via the /usage full chat command: 🦞 OpenClaw 2026.2.26 🧮 Tokens: 9 in / 516 out 🗄️ Cache: 99% hit · 67k cached, 715 new 📚 Context: 34k/200k (17%) 67K tokens served from cache, only 715 new tokens computed. Before the fix, the 4,500-token memory injection was in the "new" bucket every turn. Now it's in the 67K cached bucket. The change to Cost Explorer followed. The "Input (uncached)" line dropped, and the "Cache Read" line absorbed that traffic at a tenth of the price. 1. Prompt caching only caches a stable prefix. Everything up to the first point of variance between requests is cacheable. Everything after is not. If you're injecting repeated context, put it early in the request — system prompt, tool definitions, or the first message of a consistent message sequence. 2. User content is almost always the wrong place for stable context. The user's actual question varies every turn. Anything you concatenate with it inherits that variance and becomes uncacheable. Pull it out into a system message. 3. Watch cache writes in your bill. Cache writes cost more than regular input (1.25x on Haiku 4.5). If you see high cache writes, it means your TTL is expiring between requests and the cache is being rewritten each time. Keep the cache warm — for cacheRetention: "short" (5-min TTL), a heartbeat every ~4 minutes avoids cold-cache rewrites. None of this is a critique of agent-memory-daemon — the daemon did exactly what it was supposed to do. It produced a stable, size-budgeted 18KB memory file. The integration code I wrote around it was putting that output in the wrong place. In fact, the daemon's design (stable output size, consistent content, regular regeneration rhythm) is ideal for prompt caching. As long as you feed it into a system message, Bedrock can cache the whole thing for the TTL window, and the daemon's periodic consolidation doesn't bust the cache more often than necessary. If you're running OpenClaw or any agent on Bedrock and want persistent memory without a managed memory service, the pattern works well: Run agent-memory-daemon alongside your agent Sync the memory directory to S3 between sessions (or use a mounted filesystem if available) Load the curated MEMORY.md at the start of each conversation Inject it as a system message, not user content Enable cacheRetention on your model config The daemon handles the hard part (curating memories without bloat). Bedrock handles the cheap part (caching the stable prefix). You just have to put the memory in the right place. / agent-memory-daemon Open-source memory manager daemon for AI agents Open-source memory consolidation and extraction daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic. Agents feed it raw observations as markdown files; the daemon runs two complementary modes: Consolidation — periodically reorganizes, deduplicates, and prunes existing memory files via a four-phase pass (orient → gather → consolidate → prune) Extraction — watches for new session content and runs an LLM pass to identify facts, decisions, preferences, and error corrections worth remembering, writing them as individual memory files The filesystem is the interface — no SDK, no API, no MCP required. The LLM backend is pluggable (OpenAI, Amazon Bedrock, or anything with a chat API). memconsolidate is a standalone, agent-agnostic daemon — available to anyone building with OpenClaw, Strands, LangChain, or any custom agent framework. How it works Consolidation (reorganize existing memories) Agents write markdown memory files (with YAML frontmatter) to a watched directory A three-gate… View on GitHub / openclaw-agentcore-personal Deploy Your Personal OpenClaw on AWS AgentCore — Serverless, ~$9/month Cost-optimized OpenClaw deployment using AWS Bedrock AgentCore Runtime. Connect via Discord, WhatsApp, Telegram, or Slack. ~$9-15/month infrastructure. What Is This? A single-user, serverless deployment of OpenClaw on AWS. Instead of running an EC2 instance 24/7, the AI runs on-demand via AgentCore Runtime — the container freezes between invocations, so you only pay when you use it. All messaging plugins (WhatsApp, Telegram, Discord, Slack) are pre-installed in OpenClaw. This template includes a Discord bot by default, but you can connect any platform directly through the OpenClaw Web UI. Architecture You (Discord / WhatsApp / Telegram / Slack) │ ▼ ┌──────────────────────────────────────────────────────────┐ │ AWS Cloud │ │ │ │ EC2 t4g.nano ──invoke──▶ AgentCore Runtime │ │ (Discord bot) (OpenClaw container) │ │ │ │ │ IAM Role │ │ │ │ │ Bedrock │ │ (Haiku/Sonnet/Nova) │ │ │ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ … View on GitHub the full AgentCore deployment, including the system-message fix and the S3 sync layer Part of the OpenClaw Challenge.