arxiv_cs_cv 2026年4月24日

LatRef-Diff: 潜在空間と参照に基づく拡散モデルによる顔の属性編集とスタイル操作

LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

Translated: 2026/4/24 19:42:04

diffusion-modelsstyle-transferface-editinggancomputer-vision

Japanese Translation

arXiv:2604.21279v1 発表タイプ：新作要約: 顔の属性編集およびスタイル操作は、仮想アバターや写真編集などのアプリケーションにおいて不可欠です。しかし、他の特徴を変化させずに顔の属性を正確に制御することは、顔構造の複雑さや属性間の強相関のため困難です。コンディショナル GAN は進歩を遂げていますが、精度問題や学習不安定さによって制約されています。拡散モデルは有望ですが、意味方向の表現力の限界によりスタイル操作において課題を抱えています。本論文では、これらの限界に対処する新たな拡散ベースの枠組みである LatRef-Diff を提案します。伝統的な意味方向をスタイルコードに置き換え、生成方法として潜在空間ガイドと参照ガイドの 2 つのアプローチを提案します。これらのスタイルコードに基づき、ランダムな操作からカスタマイズされた操作まで両立可能なスタイル調律モジュールを設計します。このモジュールは、学習可能ベクトル、クロスアテンション機構、階層的設計を統合し、精度と画像品質を向上させます。さらに、ペア画像（例：編集前後）の不要化かつ学習不安定性の排除を達成するために、事前・事後一貫性学習戦略を提案します。この戦略は、まず画像固有の意味方向を使って目標属性を大まかに除去し、その後、感知損失と分類損失に基づいてスタイル調律によってそれを復元します。CelebA-HQ における大規模な実験により、LatRef-Diff が定性的・定量的評価の両方で最良の性能を示したことが実証されました。除去実験は、本モデルの設計選択の効果を验证しました。

Original Content

arXiv:2604.21279v1 Announce Type: new Abstract: Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model's design choices.