arxiv_cs_lg 2026年4月24日

Rays as Pixels: 映像とカメラの軌跡の共通分布を学習する

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Translated: 2026/4/24 20:14:10

ray-tracingcomputer-visionvideo-generationdiffusion-modelscamera-trajectory

Japanese Translation

arXiv:2604.09429v3 Announce Type: replace-cross 摘要: 画像とシーンからカメラパラメータを回復させること、そして新視点からシーンを再生成すること、これまでコンピュータービジョンおよびグラフィックスにおいて個別のタスクとして扱われてきました。この分離は、画像の覆いが薄密であるか、姿勢が曖昧な場合に破綻し、各タスクは他者が生成した結果に依存するからです。我々は、映像およびカメラの軌跡の共通分布を学習する Video Diffusion Model (VDM)「Rays as Pixels」を提案します。我々の知る限り、これはカメラの姿勢を予測し、カメラ制御下の映像生成を単一フレームワークで行う最初のモデルです。我々は、各カメラを密集したレーズルピクセル（raxels）として表現し、これは映像フレームと同じ潜在空間に存在するピクセル整列符号化であり、Decoupled Self-Cross Attention 機構を通じて両方を同時にノイズ除去します。単一のトレーニングされたモデルは、3 つのタスクを処理します：映像からカメラ軌跡の予測、事前定義された軌跡に沿って入力画像から映像の生成、および入力画像からの映像と軌跡の共同合成。我々は姿勢推定とカメラ制御下の映像生成を評価し、モデルの予測姿勢とそれらに応じて条件付けられた再生成が一致することを示すクローズドループ自己一貫性テストを導入しました。Pl"ucker埋め込みに対するアブロバシオンは、映像と共有潜在空間でカメラを表現することが著しく効果的であることを確認しました。

Original Content

arXiv:2604.09429v3 Announce Type: replace-cross Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Pl\"ucker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.