arxiv_cs_cv 2026年4月24日

StyleVAR: 視覚的自己回帰モデルによるコントロール可能な画像スタイル転移

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Translated: 2026/4/24 19:40:50

variational-autoencoderautoregressive-modelingimage-stylizationtransfer-learningreinforcement-learning

Japanese Translation

arXiv:2604.21052v1 Announce Type: new Abstract: 私たちは Visual Autoregressive Modeling (VAR) フレームワークに基づき、スタイル転移を学習された潜在空間における条件付けされた離散シークエンスモデル化として形式化了。画像はマルチスケール表現に分解され、VQ-VAE によって離散コードへトークン化される。その後、トランスフォーマーがスタイルとコンテンツトークンに対して条件付けされ、目標トークンの分布を自己回帰的にモデル化する。スタイルとコンテンツ情報を注入するために、我々は目標表現が自身の履歴に注意を向けつつ、スタイルとコンテンツ機能（query）がどのような履歴を強調するかを決定する、ブレンドクロスアテンション機構を導入した。各段階におけるスタイルとコンテンツの相対的な影響を制御するために、スケール依存のブレンド係数を設定し、合成表現がコンテンツ構造とスタイルテクスチャに一致するように誘導しつつ、VAR の自己回帰的連続性を破ることなくした。我々は、大規模なトリプレットデータセット（コンテンツ - スタイル - 目標画像）における上流学習での微調整、および DreamSim ベースの感知報酬に対する Group Relative Policy Optimization (GRPO) を用いた強化学習での微調整の 2 つの段階から StyleVAR を訓練した。各アクションの正規化ウェイトにより、VAR のマルチスケール階層におけるクレジットを再バランス化した。分布内および外（近畿および遠隔）を含む 3 つのベンチマークにおいて、StyleVAR は AdaIN ベースラインよりも常に superior であり、Style Loss、Content Loss、LPIPS、SSIM、DreamSim、CLIP 類似度においてそれぞれ優位を維持し、GRPO ステージでは SFT チェックポイントに対してさらなる改善をもたらした。これは特に報酬に一致する感知指標において顕著であった。定性的な評価では、本方法はテクスチャーを転移しつつ意味的構造を保ち、特に風景や建築シーンにおいて優れていることが確認された。しかし、インターネット画像における一般化ギャップと人物像との扱いの難しさは、より高いコンテンツの多様性とより強い構造的事先を必要とすることを示している。

Original Content

arXiv:2604.21052v1 Announce Type: new Abstract: We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.