arxiv_cs_cv 2026年4月24日

Render-in-the-Loop: 可視自己フィードバックによるベクター図形の生成

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

Translated: 2026/4/24 19:52:39

multimodal-large-language-modelssvg-generationvisual-self-feedbackrender-in-the-looptext-to-svg

Japanese Translation

arXiv:2604.20730v2 Announce Type: replace 摘要：マルチモーダル大規模言語モデル（MLLMs）は、直接コード合成を通じてスケーラブルなベクター図形（SVG）の生成において有望な能力を示しています。しかし、既存のパラダイムは、モデルが中間的な視覚的結果を認識することなく記号的なコード列を生成するオープンループの「盲目的な描画」アプローチを採用する傾向にあります。この手法は、MLLM のビジュアルエンコーダーに埋め込まれた強力なビジュアル事前知識を非常に低利用化しており、SVG 生成を統合的な視覚空間処理タスクとしてではなく、断片的なテキストシークエンスモデリングタスクとして扱っています。その結果、モデルは部分的なキャンバス状態や明示的に視覚的に表現されながらテキスト的には曖昧な暗黙的な不透明関係について推理することが困難です。このギャップを埋めるために、私たちは SVG 合成を段階的な視覚コンテキスト認識プロセスとして再定義する、新しい生成パラダイム「Render-in-the-Loop」を提案します。中間のコード状態を累積的なキャンバスへレンダリングすることにより、モデルは各ステップで進化する視覚的文脈を明示的に観察でき、オンザフライフィードバックを活用して次の生成を導き出します。しかし、私たちはこの視覚ループを現成モデルに恣意的に適用することは、累積的なビジュアル・コードマッピングを活用できないため最適ではないことを示しました。これを解決するために、私たちはまず微細なパス分解を利用し、高密度的多段の視覚軌道を作成し、次に「Visual Self-Feedback（VSF）」トレーニング戦略を導入して、中間的な視覚状態を条件として次の primitives（基本要素）生成を調整します。さらに、「Render-and-Verify（RaV）」推論機構を提案し、退行や余分な primitives を効果的にフィルタリングします。私たちの枠組みはマルチモーダル基礎モデルインスタンス化によって、標準的な MMSVGBench で強力なオープンウエイトベースラインを超える性能を発揮しました。この結果は、Text-to-SVG と Image-to-SVG の両タスクにおいて、私たちの Render-in-the-Loop パラダイムが顕著なデータ効率性と汎化能力を有していることを示しています。

Original Content

arXiv:2604.20730v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.