arxiv_cs_cv 2026年4月24日

Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

Translated: 2026/4/24 19:45:58

3d-human-meshvisual-transformerdiffusion-modelocclusion-robustnesscomputer-vision

Japanese Translation

arXiv:2604.21712v1 発表タイプ: 新規要約：単眼 RGB 画像から 3D ヒューマンメッシュを回復させることは、ダウンストリームアプリケーションのために解剖学的に妥当な 3D ヒューマンモデルを推定することを目的としていますが、部分的あるいは深刻な奥視が発生すると依然として困難です。回帰に基づく手法は効率的ですが、制約のないシナリオではよくできない結果または不正確な結果を出力することが多いため、拡散モデルに基づく手法は奥視された領域に対する強力な生成先験を提供しますが、生成への過信により稀なポーズへの忠実性が弱まる可能性があります。これらの制限を解決するため、私たちは視覚トランスフォーマーの判別能力と条件付き拡散モデルの生成能力を統合した、脳に着想を得た協調的フレームワークを提案します。具体的には、ViT ベースの経路は視覚領域から決定論的な視覚的クイアを抽出し、拡散モデルベースの経路は構造的に整合したヒューマンボディ表現を合成します。この二つの経路を効果的に橋渡すために、私たちは判別的特性を生成先験と整列させるための多様性と一貫性の高い特徴学習モジュールを設計し、半導体レベルを超えた両方向の相互作用を可能にするクロスアテンションマルチレベル融合機構を実装しました。標準ベンチマークでの実験により、我々の手法は主要な指標においてより優れたパフォーマンスを実現し、複雑なリアルワールドシナリオにおいて強固な頑健性を示しました。

Original Content

arXiv:2604.21712v1 Announce Type: new Abstract: 3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.