arxiv_cs_cv 2026年2月10日

InternSVG: 多モーダル大言語モデルを活用した統合 SVG タスクへの取り組み

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Translated: 2026/3/15 14:47:02

internsvgmllmsvgcomputer-visionmulti-modal

Japanese Translation

arXiv:2510.11341v4 Announce Type: replace Abstract: 一般的な SVG モデル링には、断 fragmentされたデータセット、タスク間での手法の転移性の限界、構造化された複雑性の取り扱いの困難さが理由として挙げられます。これに対する対応として、私たちは多モーダル大言語モデル（MLLM）の強い転移および一般化能力を活用し、SVG の理解、編集、生成のための統合モデリングを実現しました。これにより、InternSVG ファミリ、統合されたデータ・ベンチマーク・モデルスイートを示します。その中心には SAgoge という最大の、かつ最も包括的な SVG タスク用の多モーダルデータセットがあり、静的グラフィックと動的アニメーションの両方を網羅しています。アイコン、長期の挿絵、科学図、動的アニメーションをカバーし、多様な難易度のタスクをサポートするとともに、過去のデータセットと比較してより深い階層構造と豊富な属性を提供します。このリソースに基づいて、SArena という付随するベンチマークを導入し、広範なタスク定義と標準化された評価を採用、SAgoge がカバーするドメインと難易度スペクトルとの一致を実現しています。これらの基礎の上に、SVG 特定トークン、単語ベースのエMBEDDING 初期化、短期間静的 SVG から長期の挿絵、そして複雑なアニメーションへと段階的に進む 2 段階のトレーニング戦略を持つ統合 MLLM IntenSVG を提案します。この統合的アプローチはポジティブな転移を引き起こし、全体のパフォーマンスを向上させます。SArena と以前のベンチマークにおける実験が確認し、InternSVG は著しい向上を実現し、先進的なオープンおよびプロプライエタリ counterpart を一貫して凌駕します。

Original Content

arXiv:2510.11341v4 Announce Type: replace Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.