arxiv_cs_cv 2026年4月20日

GenHSI: ヒューマン・シーンインタラクション動画の制御可能生成

GenHSI: Controllable Generation of Human-Scene Interaction Videos

Translated: 2026/4/20 10:49:49

diffusion-modelsvideo-generationhuman-scene-interactioncomputer-visionai-animation

Japanese Translation

arXiv:2506.19840v2 発表型式：置換要旨：大規模な事前学習されたビデオ拡散モデルは、多様なビデオ生成で顕著な能力を示している。しかし、既存の解決策は、現実的なダイナミクスやアフライアンス（機能的な可能性） unrealism が見られ、主体的なアイデンティティの保持に欠け、高額のトレーニングが求められるなど、ヒューマン・シーンインタラクション（HSI）を含む長動画の生成においていくつかの課題を抱えている。これに応じて、私たちは、3D 認知を備えた長い HSI 動画の制御可能生成のためのトレーニング不要な方法である GenHSI を提案する。映画アニメーションの靈感を得て、ビデオ合成を 3 つの段階に分けた：（1）台本執筆、（2）プレビジュアリゼーション、（3）アニメーション。シーンの画像とユーザーの説明付きのキャラクターが与えられる場合、これらの 3 つの段階を使用して、ヒューマンのアイデンティティを維持し、豊かで合理的な HSI を提供する長動画を実現する。台本執筆は、HSI のチェーンを含む複雑なテキストプロンプトを、プレビジュアリゼーション段階で使用される単純な原子動作に変換する。3D キーフレーム上で合理的な人間のインタラクションポーズを合成するため、私々は、ビュー正規化に基づいて合理的な 2D ヒューマンインタラクションを生成する、事前学習された 2D インペイント拡散モデルを利用する。これにより、従来の作業におけるマルチビュー適合の必要性が除去される。その後、接触のサインと VLM からの推論に基づいた堅牢な反復最適化を使用して、これらインタラクションを 3D に拡張する。これらの 3D キーフレームがプロンプトされて、事前学習されたビデオ拡散モデルは、3D 認知の方法で合理的なダイナミクスとアフライアンスを備えた一貫性のある長動画をより良く生成する。私たちは、シーンの画像参照に基づいてトレーニングなしで HSI 動作のチェーンを含む長いビデオシーケンスを合成する第一者である。実験は、私々の方法が 1 つのシーンの画像シーンから、シーンのコンテンツとキャラクターのアイデンティティを効果的に保持し、合理的なヒューマン・シーンインタラクションを持つ HSI 動画を生成できることを示している。

Original Content

arXiv:2506.19840v2 Announce Type: replace Abstract: Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in generating long videos with rich human-scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training. To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness. Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI. Script writing converts a complex text prompt involving a chain of HSI into simple atomic actions that are used in the pre-visualization stage to generate 3D keyframes. To synthesize plausible human interaction poses in 3D keyframes, we utilize pre-trained 2D inpainting diffusion models to generate plausible 2D human interactions based on view canonicalization, which eliminates the need for multi-view fitting in previous works. We then extend these interactions to 3D using robust iterative optimization, informed by contact cues and reasoning from VLMs. Prompted by these 3D keyframes, the pretrained video diffusion models can better generate consistent long videos with plausible dynamics and affordance in a 3D-aware manner. We are the first to synthesize a long video sequence with a chain of HSI actions without training based on the image references of the scene and character. Experiments demonstrate that our method can generate HSI videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene.