arxiv_cs_cv 2026年2月10日

ReRoPE: RoPE を再利用した相対カメラ制御

ReRoPE: Repurposing RoPE for Relative Camera Control

Translated: 2026/3/15 19:04:22

arxiv-2602-08068video-generationcamera-controlrotary-positional-embeddingsdiffusion-models

Japanese Translation

arXiv:2602.08068v1 発表型: 新しい要旨: 制御可能なカメラアングルを持つ動画生成は、インタラクティブコンテンツ作成、ゲーム、シミュレーションなどのアプリケーションにおいて不可欠である。既存の手法では、事前に学習済み動画モデルが、固定された参照（例: 最初のフレーム）に対して定義されたカメラポーズをベースに適応させるようになっている。しかし、これらのエンコーディングはシフト不変性を欠いており、よく一般化できない、あるいは蓄積したズレを引き起こす。任意のビューペアの間で定義された相対カメラポーズエンベディントはより堅牢な代替方案であるが、禁止的な学習コストやアーキテクチャ変更を伴わずにこれを事前に学習済み動画拡散モデルに統合することは依然として課題である。ReRoPE は、生成能力を損なわずに事前に学習済み動画拡散モデルに相対カメラ情報を組み込む Plug-and-Play フレームワークを導入する。我々のアプローチは、既存モデルにおける回転位置エンベディント (RoPE) がそのフルなスペクトル帯域を、特にロー频频域成分において十分に利用していないという洞察に基づいている。相対カメラポーズ情報をこれらの不足している帯域にシームレスに注入することで、ReRoPE は高精度な制御を実現しつつ、強力な事前学習生成事前提を維持する。我々は、カメラ制御精度と視覚的保真度という観点で、画像から動画 (I2V) および動画から動画 (V2V) の両タスクで我々の手法を評価した。我々の結果は、ReRoPE が制御可能で高保真な動画生成への効率的な学習パスを提供すると示している。詳細な結果はプロジェクトページを参照してください: https://sisyphe-lee.github.io/ReRoPE/

Original Content

arXiv:2602.08068v1 Announce Type: new Abstract: Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/