dev_to 2026年4月25日

見えない 43％— チームが LLM API 予算の半分を無駄にしている方法

The Hidden 43% — How Teams Waste Half Their LLM API Budget

Translated: 2026/4/25 0:00:45

llmapi-cost-optimizationopenaiprompt-engineeringobservability

Japanese Translation

プロバイダーのダッシュボードは、たった一つの数字——あなた総額——を表示します。それは、分解なしに電気代を請求されただけを知らされているように見えます。合計額が見えるだけで、エアコンのスイッチを忘れた人がいないかほかに誰もいないかを祈るしかありません。 Tbh, API ログを詳しく見れば、あなたは probably 予算の 43% を浪費しているはずです。私は数週間にわたり、異なるチームの LLM 使用状況を見つめ、同じリークがどこにも見られます。あなたの金貨が実際にどこに行っている場所：あなたのプロンプトは有効な JSON を返しません。エージェントは再試行します。再び失敗します。すぐに、あなたの while ループが 40 回発火します。10k トークンずつのクローン 3.5 ソネットの場合、その 1 つのユーザーの相互作用はあなたに多くを費やしました。ユーザーは同じ質問をします。セマンティックなキャッシュなしに、あなたは OpenAI に 1 日に 100 回同じ答えを生成することを支払います。すべてのチャット履歴を単一のリクエストに送信せずに短縮せず。あなたは最後に数ターンのみが必要ですが、あなたの Wrapper は 50k トークン「ちょうどの場合」を送ります。基本的なルーティングまたは分類タスクのために GPT-4o を使用して、それは 10 倍速く非常に小さい、安いモデルがそれを行うことができます。あなたはあなたが見えないものを修理できません。あなたはあなたが per-tenant 費用的な帰属性を持っていないならば、あなたは目を閉じています。あなたは、どのユーザー、モデル、および機能がトークンを燃えているのかを知る必要があります。私は LLMeter (AGPL-3.0 開 sourced) を構築して、それらを解決しました。それは、モデルごとに、ユーザーごとに、日ごとにコストを追跡します。それは、あなたにプロキシを通ってトラフィックをルーティングする必要がない、正確な分解を、OpenAI、Anthropic、DeepSeek、OpenRouter へ直接接続します。推測を止めます。あなたの per-ユーザーコストを追跡します：https://llmeter.org?utm_source=devto&utm_medium=article&utm_campaign=devto-hidden-43-percent

Original Content

The provider dashboards show you one number — your total bill. That's like getting an electricity bill with no breakdown. You just see the total and hope nobody left the AC on. Tbh, if you look closely at your API logs, you are probably wasting around 43% of your budget. I spent the last few weeks analyzing LLM usage across different teams, and the same leaks happen everywhere. Here is where your money is actually going: Your prompt fails to return valid JSON. The agent retries. It fails again. Next thing you know, your while-loop has fired 40 times. At 10k tokens a pop on Claude 3.5 Sonnet, that single user interaction just cost you a lot. Users ask the same questions. Without semantic caching, you are paying OpenAI to generate the exact same answer 100 times a day. Sending the entire chat history in every single request without truncation. You only need the last few turns, but your wrapper is sending 50k tokens "just in case." Using GPT-4o for basic routing or classification tasks when a much smaller, cheaper model could do it 10x faster. You can't fix what you can't see. If you don't have per-tenant cost attribution, you are flying blind. You need to know exactly which user, model, and feature is burning tokens. I built LLMeter (open-source AGPL-3.0) to solve this. It tracks costs per model, per user, per day. It connects directly to OpenAI, Anthropic, DeepSeek, and OpenRouter to give you the exact breakdown without needing to route your traffic through a proxy. Stop guessing. Track your per-user costs: https://llmeter.org?utm_source=devto&utm_medium=article&utm_campaign=devto-hidden-43-percent