arxiv_cs_cv 2026年4月24日

Reshoot-Anything: 在野動画の再撮影用の自己教師ありモデル

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Translated: 2026/4/24 19:46:30

reshootingself-supervisedvideo-processingmonocular-estimationdiffusion-transformer

Japanese Translation

arXiv:2604.21776v1 Announce Type: new 【要約】非刚体シーンの再撮影における精密なカメラ制御は、非剛体シーンのためのペア付けられたマルチビューデータの著しい不足に阻害されています。当チームは、インターネット規模のモノクローマ映像を活用できるスケールが大きい自己教師ありフレームワークでこの限界を克服しました。当社の主要な貢献は、ソースビデオ、幾何学的アンカー、そしてターゲットビデオを構成する擬似マルチビュートレーニングトライプレットの生成です。これは、単一の入力ビデオからソースビューとターゲットビューを服务するための、別々の滑らかなランダムウォーククロープ軌跡を抽出することによって実現されています。アンカーは、ソースビデオの第一フレームを密集トラッキングフィールドで前向きに歪曲（forward-warp）することによって合成的に生成され、これにより推論時に期待される歪んだポイントクラウド入力を効果的にシミュレートします。当社の独立したクローピング戦略は空間的な不整合と人為的な被写体の遮蔽を招き、モデルは現在のソースフレームから情報を単純にコピーすることができず、代わりにソースビデオの異なる時間と視点から欠損した高画質なテクスチャをアクティブにルーティングし再投影することによって、4D スペースタイム構造を明示的に学習を強いられています。推論時において、当社の最小限に適応された拡散トランスフォーマーは、4D ポイントクラウド派生アンカーを利用し、複雑な動的シーンの上で最先端の時間的一貫性、堅固なカメラ制御、そして高画質の新しい視点合成を実現しています。

Original Content

arXiv:2604.21776v1 Announce Type: new Abstract: Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.