arxiv_cs_cv 2026年2月10日

Chain-of-Caption: 参照表現理解におけるマルチモーダル大規模言語モデルのトレーニングフリー向上

Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

Translated: 2026/3/15 19:05:19

chain-of-captionmultimodal-large-language-modelreferring-expression-comprehensionmlllmtool-use

Japanese Translation

arXiv:2602.08211v1 発表種別：新規要約：テキスト記述が与えられた際、参照表現理解（REC）タスクは参照されているオブジェクトを画像に定位させることを指します。マルチモーダル大規模言語モデル（MLLM）は、モデルサイズとトレーニングデータの拡大によって REC ベンチマークにおいて高い精度を達成しています。さらに、Chain-of-Thought やツール利用などの技術を用いることで、モデルに追加の視覚的またはテキストコンテキストを提供することで、MLLM の性能をさらに向上させることが可能です。本論文では、ツール利用を通じて追加の視覚的・テキストコンテキストを提供するための各種技術が MLLM と REC タスクに与える影響を分析します。さらに、RLMF をトレーニングフリーフレームワークと命名して提案し、MLLM の REC 性能を向上させます。RefCOCO/RefCOCOg/RefCOCO+ および Ref-L4 データセット上で実験を行い、個別のテキストまたは視覚コンテキストが存在する場合、ファイントューニングなしに REC 性能が向上することを示しました。複数のコンテキストを組み合わせることで、当社のトレーニングフリーフレームワークは、精度に対して様々な交差回分（IoU）閾値においてベースラインモデルより 5% から 30% の性能向上を示しました。

Original Content

arXiv:2602.08211v1 Announce Type: new Abstract: Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.