arxiv_cs_lg 2026年4月24日

Meta-Tool: 小型言語モデル向けの効率的な少ショットツール適応

Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

Translated: 2026/4/24 20:03:59

llamafew-shot-learningmachine-learningnatural-language-processingparameter-efficient

Japanese Translation

arXiv:2604.20148v1 Announce Type: cross 摘要：小型言語モデルは、複雑な適応機構なしに強力なツール使用性能を達成できるか？本論文は、Meta-Tool という制御された実証研究を通じて、この疑問を探求します。Meta-Tool は、ハイパーネットワークに基づく LoRA 適応と、慎重に設計された少ショットプロンプティングを比較する研究です。Llama-3.2-3B-Instruct を使用して、Gorilla APIBench、Spider 2.0、WebArena、InterCode の 4 つの多様なベンチマークにおいて、少ショットプロンプティング、ドキュメントエンコーディング、ハイパーネットワーク生成の LoRA 重み、価値指向ビームサーチの 4 つの適応機構を評価しました。私たちの中心的な発見は、強く支持される負の結果です：非自明の重み行列を生成する 2.278 億のパラメータを持つハイパーネットワークは、少ショットプロンプティング単独よりも可視的な改善をもたらさないことが確認されました。包括的なアベイレーション研究により、少ショット例は性能に +21.5% 貢献し、ドキュメントは +5.0% 貢献し、一方ハイパーネットワークは 0% の追加貢献しか与えないことがわかりました。良く設計されたプロンプトを持つ 3B モデルは、GPT-5 の平均性能の 79.7% を、$10 imes$ 低い遅延で達成しました。0 から 5 までのショット数を覆う 722 の失敗ケースにおけるエラー解析は、5ショット配置（106 件の失敗）における失敗モードはタスク依存であることを示しています：スキーマ重視のタスク（Spider 2.0、WebArena）ではフォーマットエラーはほぼゼロで、残りの失敗がセマンティックエラーである一方、Gorilla ではフォーマットエラーが支配的（100%）であり、InterCode では 70% です。これらの発見は、実務者に対し、複雑な適応アーキテクチャではなくプロンプトエンジニアリングと例のキュレーションへと導きます。

Original Content

arXiv:2604.20148v1 Announce Type: cross Abstract: Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at $10 \times$ lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.