arxiv_cs_cv 2026年2月10日

死ピクセルから編集可能なスライドへ：視言語領域理解に基づく情報図の再構築と本質的な Google スライドへの変換

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Open original article

Translated: 2026/3/15 18:05:19

google-slidesinfographicvision-language-modelimage-processingtext-transcription

Japanese Translation

arXiv:2602.07645v1 発表タイプ：新規要旨：情報図はテキスト、アイコン、およびデータ可視化を組み合わせて情報を伝えるために広く使用されていますが、画像としてエクスポートされるとコンテンツがピクセルに固定され、更新、ローカライズ、再利用がコストのかかるものになります。我々は、視言語モデル（VLM）を用いて地区レベルの仕様を抽出し、ピクセルの幾何学をスライド座標にマッピングし、Google スライドのバッチ更新 API を使用して要素を再構築することで、静的な情報図（PNG/JPG）を本質的で編集可能な Google スライドのスライドに変換する、API ベースのパイプライン extsc{Images2Slides} を説明します。このシステムはモデルに依存せず、共通の JSON 地区 Schema と決定論的な後処理を通じて複数の VLM バックエンドをサポートします。既知の真の値地区を持つ、制御された 29 のプログラム的に生成された情報図スライドの評価において、 extsc{Images2Slides} は全体の要素回復率が $0.989\±0.057$（テキスト：$0.985\±0.083$、画像：$1.000\±0.000$）、テキスト領域の平均文字転写誤差 $ ext{CER}=0.033\±0.149$、画像領域の配置忠実度 $ ext{IoU}=0.364\±0.161$、および画像領域の $ ext{IoU}=0.644\±0.131$ となっています。我々は、再構築の実用的なエンジニアリング課題として文字サイズ調整と非均一な背景にも指摘し、今後の作業を導く失敗モードについて説明します。

Original Content

arXiv:2602.07645v1 Announce Type: new Abstract: Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.