arxiv_cs_ai 2026年4月24日

大規模言語モデル API ゲートウェイにおける行動の整合性と透明性の分析

Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways

Translated: 2026/4/24 20:23:18

api-gatewayslarge-language-modelsblack-box-testingmodel-consistencytransparency-analysis

Japanese Translation

arXiv:2604.21083v1 Announce Type: cross 摘要: サードパーティ製の大規模言語モデル（LLM）API ゲートウェイは、複数のベンダーが提供するモデルへの統一されたアクセスポイントとして急速に登場しています。しかし、これらのゲートウェイの内部ルータ、キャッシュ、請求ポリシーは広く開示されておらず、ユーザーは要請が広告されたモデルによって処理されているか、応答がアップストリーム API に忠実であるか、請求書が公的な価格政策を正確に反映しているかという点に限られた視点しか持てません。このギャップに対処するため、我々は商業 LLM ゲートウェイの行動的整合性と運用透明性を評価するための軽量なブラックボックス測定フレームワークである GateScope を導入しました。GateScope は、モデルのランクダウンや切り替え、サイレント切り短縮、請求の不正確さ、およびレイテンシの不安定性を含む主要な誤動作を検出するように設計されており、応答内容分析、マルチターン対話のパフォーマンス、請求精度、およびレイテンシ特性という 4 つの重要な次元に沿ってゲートウェイを検査します。10 つの現実世界の商業 LLM API ゲートウェイを対象とした我々の測定結果は、期待された行動と実際の行動の間に頻繁なギャップが存在することを示しており、サイレントなモデル置換、劣化したメモリ保持、発表された価格からの逸脱、そしてプラットフォーム間でレイテンシ安定性の顕著な変動を包括しています。

Original Content

arXiv:2604.21083v1 Announce Type: cross Abstract: Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.