dev_to 2026年3月21日

マルチプルコストをかけずにエージェントシステムをアーキテクチャする方法：実際の医療事例

Architecting Agentic Systems Without Multiplying Costs: A Real Healthcare Story

Translated: 2026/3/21 7:02:00

agent-aihealthcare-techmodel-distillationapi-costsystem-architecture

Japanese Translation

全ての物語が始まったメッセージ患者ポータルに出現したメッセージは、朝の月曜日の早期のものでした。「3 日間にわたって鋭い下半身の痛みを感じています。心配すべきですか？」目を通すだけで、これは単純な問い合わせに見えますが、実際の医療システムにおいてこれを正しく答えるには、階層的な推論が必要です。システムは症状を解釈し、既往歴を考慮し、リスクを評価し、臨床指針を適用した上で推奨を作出する必要があります。国家保健プロバイダーには、毎日数千通の此类のメッセージが到着します。このスケールに対応するために、エンジニアリングチームがエージェント AI システムを構築しました。エージェントシステムは単純な AI 応答システムとは異なります。一つのステップで答えを生成するのではなく、一連の推論アクションを実行します。何をすべきかを計画し、情報を取得し、コンテキストを分析し、決定を検証し、最後に出力を生成します。多くの場合、単一の予測ではなく、知的なステップのワークフローのように振る舞います。このシステムは極めてよく機能していました。しかし、コストが無視できないまでに高くなりました。各リクエストは構造化された推論ワークフローをトリガーしました：計画 → 取得 → 分析 → 検証 → 応答各ステップは大型言語モデルの呼び出しを必要としました。チームが使用状況を集計したところ、システムが毎日約 5 万回のリクエストを処理していることがわかりました。各リクエストは約 5 つの推論ステップをトリガーし、各ステップは約 1,000 トークンを処理しました。トークンは言語モデルが使用するテキストの単位です。単語、語の一部、あるいは記号を表すことができます。モデルの料金はおおむね処理されたトークンの数に基づいています。これは、システムが毎日約 2 億 5,000 万トークンを処理していることを意味しました。フラグシップモデルの料金基準では、これが月間で 6 万ドルを超えるコストに resultedしました。ストラテジーモデル 100 万トークンあたりのコスト月間のコスト拡張性フラグシップ API Claude / Gemini Pro 7〜9 ドル 5 万〜7 万ドル線形管理型小型モデル Haiku / Flash 約 1 ドル約 7,500 ドル線形自己ホスト Distilled 8B + vLLM 固定約 450 ドル階層重要な観察はコストそのものではありませんが、その拡張性です。 API ベースのシステムはトークン単位で課金するため、コストは使用量に応じて線形に増加します。一方、自己ホストのシステムはインフラに依存します。ハードウェアの提供がなされると、追加の使用量は即座にコストを増加させません。コストの爆発は、API ベースのシステムにおけるエージェントワークフローの執行情報にあります。各推論ステップはステートレスです。ステートレスなシステムは以前の相互作用を記憶しません。つまり、各ステップでは関連するコンテキストをすべて含める必要があります。これはシステム指令、中間の推論、取得されたデータを含まします。その結果、同じ情報が繰り返し処理されます。これにより、一つのユーザーリクエストが複数の高価な計算を惹き起こすような累積効果が生まれます。システムが実行するステップが多くなるほど、コストは乗算的に増加します。大きなスケールにおいて、チームは単に AI の機能を構築しているだけでなく、システムを設計していることに気づきました。問題は知見そのものではありませんが、その知見がどのように提供されているかです。彼らは問題を再定義しました：推論を効率的にする方法で、反復しない方法でどうすればよいですか？大規模な最先端モデル（フロントティアモデルと呼ばれることも）は依然として不可欠です。フロントティアモデルは膨大なデータで訓練された能力の高い大規模 AI モデルです。これらは強力ですが、実行コストが非常に高いです。各リクエストで使用せず、代わりにチームはそれらを異なる役割で使用しました。トレーニングフェーズの間、これらのモデルは教師として機能しました。これらはステップバイステップの説明を含む高品質な推論例を生成し、それらの例を使用して小型モデルを訓練しました。このプロセスはディストリクション（精製）と呼ばれます。ディストリクションは、大規模モデル（教師）の知識を小型モデル（生徒）に転移させるプロセスで、小型モデルが大型モデルの振る舞いを低コストで模倣することを可能にします。本番環境では、小型モデルがほとんどのリクエストを処理しました。複雑なケースのみが元に戻して送信されました。

Original Content

The Message That Started It All It was early on a Monday morning when a message appeared in a patient portal: "I've had sharp lower back pain for 3 days. Should I be worried?" At first glance, this looks like a simple request. But in a real healthcare system, answering it correctly requires layered reasoning. The system must interpret symptoms, consider prior medical history, evaluate risk, and apply clinical guidelines before making a recommendation. At a national healthcare provider, thousands of these messages arrive every day. To handle this scale, the engineering team built an agentic AI system. An agentic system is different from a simple AI response system. Instead of generating an answer in one step, it performs a sequence of reasoning actions. It plans what to do, retrieves information, analyzes context, validates decisions, and then produces an output. In many ways, it behaves like a workflow of intelligent steps rather than a single prediction. The system worked extremely well. Until the cost became impossible to ignore. Each request triggered a structured reasoning workflow: Plan → Retrieve → Analyze → Validate → Respond Each of these steps required invoking a large language model. When the team analyzed usage, they found that the system handled around fifty thousand requests per day. Each request triggered about five reasoning steps, and each step processed roughly one thousand tokens. A token is a unit of text used by language models. It can represent a word, part of a word, or even punctuation. Model pricing is typically based on the number of tokens processed. This meant the system was processing approximately two hundred and fifty million tokens per day. At flagship model pricing, this resulted in a monthly cost exceeding sixty thousand dollars. Strategy Model Cost per 1M Tokens Monthly Cost Scaling Behavior Flagship API Claude / Gemini Pro $7–$9 $50K–$70K Linear Managed Small Models Haiku / Flash ~$1 ~$7,500 Linear Self-Hosted Distilled 8B + vLLM Fixed ~$450 Step-function The critical observation is not just the cost, but how it scales. API-based systems charge per token, so costs increase linearly with usage. Self-hosted systems, on the other hand, rely on infrastructure. Once the hardware is provisioned, additional usage does not immediately increase cost. The cost explosion is caused by how agentic workflows are executed in API-based systems. Each reasoning step is stateless. A stateless system does not remember previous interactions. This means that every step must include all relevant context again. This includes system instructions, intermediate reasoning, and retrieved data. As a result, the same information is processed repeatedly. This leads to a compounding effect where one user request results in multiple expensive computations. The more steps the system performs, the more the cost multiplies. At scale, the team realized they were not just building an AI feature. They were designing a system. The problem was not the intelligence itself, but how that intelligence was being delivered. They reframed the problem: How do we make reasoning efficient instead of repeated? Large, state-of-the-art models, often called frontier models, are still essential. A frontier model is a highly capable, large-scale AI model trained on vast amounts of data. These models are powerful but expensive to run. Instead of using them for every request, the team used them in a different role. During the training phase, these models acted as teachers. They generated high-quality reasoning examples, including step-by-step explanations. These examples were then used to train a smaller model. This process is called distillation. Distillation is the process of transferring knowledge from a large model (teacher) to a smaller model (student), allowing the smaller model to mimic the behavior of the larger one at a lower cost. In production, the smaller model handled most requests. Only complex cases were routed back to a frontier model. Distillation allows the system to retain the reasoning ability of large models while dramatically reducing cost. By training on high-quality examples, the smaller model learns how to think in a structured way. QLoRA stands for Quantized Low-Rank Adaptation. To understand this, it helps to break it down. A model contains millions or billions of parameters. Training or modifying all of them is expensive. LoRA (Low-Rank Adaptation) introduces small, trainable components that adjust the model’s behavior without modifying the entire model. QLoRA goes further by compressing (quantizing) the base model into lower precision, which reduces memory usage and allows efficient training. In practice, QLoRA enables the creation of small, specialized modules that guide the model for specific tasks, such as generating SQL queries or applying clinical reasoning. vLLM is an inference engine designed to serve language models efficiently. An inference engine is the system responsible for running the model and generating outputs. vLLM introduces several important optimizations. One key concept is the KV cache, which stores intermediate computations from the model’s attention mechanism. Instead of recomputing everything for each step, the system reuses these cached values. Another concept is PagedAttention, which manages memory efficiently by treating it like virtual memory, similar to how operating systems manage RAM. Together, these techniques allow vLLM to handle many requests efficiently on the same hardware. Instead of using a single large model, the system was redesigned as a workflow. A shared base model provides general reasoning ability. On top of this, specialized components guide the model’s behavior for specific tasks. These components are lightweight and can be loaded dynamically, allowing the system to adapt to different steps in the workflow. When a patient submits a request, it first passes through an API gateway, which handles authentication and routing. The system then determines what information is needed. It uses a specialized component to generate a query that retrieves patient data from the medical database. Once the data is retrieved, the system transitions to a reasoning phase. It analyzes symptoms, considers patient history, and applies clinical guidelines. Throughout this process, the system maintains internal state. This allows it to reuse previous computations rather than starting from scratch each time. Finally, the system produces a triage decision. With a workload of approximately two hundred and fifty million tokens per day, the system must process around three thousand tokens per second. Modern GPU systems can handle significantly higher throughput when optimized correctly. Techniques such as batching and efficient memory management allow the system to process multiple requests simultaneously. This means that a single GPU can handle the baseline load, with additional hardware added as needed. The system can be deployed on platforms like AWS or GCP. On AWS, the system uses GPU-enabled instances for inference, along with services for storage and data processing. AWS also provides access to frontier models through services like Bedrock. On GCP, the system often uses Kubernetes for orchestration, allowing flexible scaling. GCP’s data tools are particularly strong for large-scale data processing. In both cases, the inference engine remains self-hosted, ensuring control over cost and performance. This architecture introduces complexity. Coordinating multiple reasoning components requires careful design. There is also a balance between efficiency and capability. Smaller models are faster and cheaper but may struggle with edge cases. This is why fallback mechanisms are important. Maintaining system state across multiple steps also requires careful engineering. To ensure reliability, the system must be monitored continuously. Metrics such as latency, throughput, and cost provide insight into performance. Equally important is evaluating the quality of decisions. This often involves comparing outputs against benchmarks or involving human reviewers. Over time, the system can be improved by retraining models and refining workflows. After redesigning the system, the results were dramatic. The monthly cost dropped from over sixty thousand dollars to under five hundred dollars. Performance improved, and the system became more scalable and reliable. Agentic systems are not expensive because of the intelligence they use. They are expensive because of how that intelligence is orchestrated. The goal is not to eliminate powerful models. It is to use them strategically rather than operationally. By doing so, organizations can build systems that are both powerful and sustainable. 👇 What part of your AI system is driving the most cost today?