arxiv_cs_cv 2026年4月24日

RailVQA: 自動運転列車運転における効率的な解釈可能な視覚知覚のためのベンチマークとフレームワーク

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

Translated: 2026/4/24 19:51:43

automated-train-operationrailway-visual-cognitionlarge-multi-modal-modelsvisual-question-answeringtransportation-safety

Japanese Translation

arXiv:2603.27112v2 Announce Type: replace 要約: 自動運転列車運転 (ATO) が GoA4 以上へと進展するにつれて、複雑かつ動的な鉄路環境において安全な運行を保証するために、効率的で信頼性の高い運転席視覚感知と意思決定指向の推論に依存度が高まっています。しかし、既存のアプローチは基本的な感知にのみ焦点を当てており、稀だが安全性に重要な端緒ケースには汎用性が低下する傾向があります。また、運行意思決定に必要な高いレベルの論理推論と計画能力にも欠けられています。最近の大型多モーダルモデル (LMM) は強い汎用性と認知能力を示していますが、安全性に極めて重要な ATO への適用は、計算コストの高さと幻覚（誤作動）のリスクによって妨げられています。同時に、認知能力を体系的に評価する信頼性の高いドメイン特化型ベンチマークも不足しています。これらのギャップに対処するために、私たちは RailVQA-bench を導入しました。これは ATO の運転席視覚知覚のための最初の VQA ベンチマークで、20,000 の単一フレーム対および 1,168 の動画に基づく QA パールを包含し、静的および動的なシナリオにおける認知汎用性と解釈可能性を評価します。さらに、私たちは小モデルの効率性と大モデルの認知力を透明性のある 3 モジュールアーキテクチャと適応的時系列サンプリングを通じて統合する、協力的大小モデルフレームワークである RailVQA-CoM を提案しました。これは視覚的汎用性を向上させ、より効率的な論理推論と計画を可能にします。実験は、提案されたアプローチが性能を大幅に改善し、解釈可能性を強化し、効率性を向上させ、自律運転システムにおけるドメイン間汎用性を強化することを示しています。コードとデータセットは https://cybereye-bjtu.github.io/RailVQA.html で利用可能です。

Original Content

arXiv:2603.27112v2 Announce Type: replace Abstract: As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling more efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, improves efficiency, and strengthens cross-domain generalization in autonomous driving systems. Code and datasets will be available at https://cybereye-bjtu.github.io/RailVQA.html.