arxiv_cs_cv 2026年2月10日

ForecastOcc：視覚ベースのセマンティックオキュパンス予報

ForecastOcc: Vision-based Semantic Occupancy Forecasting

Translated: 2026/3/15 19:03:37

forecastingautonomous-drivingvision-based-forecastingsemantic-occupancyarxiv-2602-08006

Japanese Translation

arXiv:2602.08006v1 Announce Type: new Abstract: 自律運転では、時間経過に伴う幾何学とセマンティクスの両方を予報する必要があるため、未来の環境状態について効果的に推論することが可能です。既存の視覚ベースのオキュパンス予報手法は、静的物体や動的物体などの運動関連カテゴリに焦点を当てていますが、セマンティック情報はほとんど欠如しています。最近のセマンティックオキュパンス予報手法はこれらのギャップを解決していますが、個別のネットワークから取得された過去のオキュパンス予報に依存しています。これにより、現在の手法は誤りの累積に感度が高く、画像から直接空間時間特徴を学習することが困難になっています。本稿では、未来のオキュパンス状態とセマンティックカテゴリを連動して予報する最初の視覚ベースのセマンティックオキュパンス予報フレームワークである ForecastOcc を提示します。我々のフレームワークは、外部マップに依存せず、過去のカメラ画像から直接複数の時間軸に対するセマンティックオキュパンス予報を生成します。ForecastOcc を 2 つの補完的な設定で評価しました：多視点予報としての Occ3D-nuScenes データセットと、単眼予報としての SemanticKITTI データセット。ここで、このタスクに対する最初のベンチマークを確立しました。我々は 2 つの 2D 予報モジュールを枠組み内に適応させることにより、最初の基準を作成しました。さらに、Temporal Cross-Attention Forecasting Module、2D-to-3D View Transformer、オキュパンス予測のための 3D エンコーダー、複数の時間軸に対してボクセルレベルの予報を行うセマンティックオキュパンスヘッドを統合したノベルなアーキテクチャを提案しました。両方のデータセットで実施した広範な実験により、ForecastOcc は基準手法を常に上回り、自律運転において不可欠なシーンダイナミクスとセマンティクスを捉えるセマンティックに豊かで未来意識のある予報を生み出しました。

Original Content

arXiv:2602.08006v1 Announce Type: new Abstract: Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.