arxiv_cs_cv 2026年2月10日

WeTok: 高解像度視覚再構築のための強力な離散トークン化

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Translated: 2026/3/15 9:02:24

visual-tokenizationvision-generationimage-compressionrfid-scormachine-learning

Japanese Translation

arXiv:2508.05599v3 発表タイプ：置換要約：視覚トークナイザーは視覚生成における重要な構成要素です。しかし、既存のトークナイザーは、圧縮率と再構築精度との間のトレードオフが満足できないという課題を抱えています。このギャップを埋めるために、我々は過去のリードトークナイザーを超えた 2 つの主要なイノベーションにより、強力かつ緊迫感のある WeTok トークナイザーを導入しました。(1) グループごとのラックフリークوانチゼーション (GQ)。我々は潜在特徴をグループに分割し、各グループに対してラックフリークワンチゼーションを実行します。これにより、GQ は従来のトークナイザーのメモリと計算上の制限を効率的に克服でき、スケーラブルなコードブックを通じて再構築のブレークスルーを実現します。(2) 生成デコーダー (GD)。従来のトークナイザーとは異なり、我々は追加のノイズ変数の事前知識を持つ生成型デコーダーを導入しました。この場合、GD は離散トークンに条件付けた視覚データの分布を確率的にモデル化でき、WeTok が特に高圧縮率において視覚的な詳細を再構築できることを可能にします。ImageNet 50k バリデーションセットにおいて、高品質設定では WeTok はゼロショット rFID が 0.12 となる記録を達成し、400% の圧縮率で FLUX-VAE (0.18) や SD-VAE 3.5 (0.19) などのリード連続トークナイザーを凌駕しました。さらに、高圧縮制約下において、WeTok は 768 倍の圧縮率でゼロショット rFID が 3.49 となり、僅か 50% の我々の圧縮率であっても 4.57 のスコアを記録する Cosmos を著しく上回りました。コードとモデルは利用可能です：https://github.com/zhuangshaobin/WeTok。

Original Content

arXiv:2508.05599v3 Announce Type: replace Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoder (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratio. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768$\times$ compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio. Code and models are available: https://github.com/zhuangshaobin/WeTok.