arxiv_cs_cv 2026年2月10日

人間のシーン理解に一致するメタメアの生成

Generating metamers of human scene understanding

Translated: 2026/3/15 16:07:12

metamer-genlatent-diffusionhuman-visiondino-v2perceptual-alignment

Japanese Translation

arXiv:2601.11675v2 発表タイプ: 置き換え要約: 人間の視覚は、視覚周辺部から得られる低解像度の「 gist （全体像）」情報と、視点を固定した位置から得られる高解像度だが疎な情報を組み合わせて、視覚シーンの統一的な理解を構築します。本論文では、人間の潜在性シーン表現に整合したシーンの生成を行うためのツール「MetamerGen」を導入します。MetamerGen は、周辺で得られるシーンの gist 情報と、シーン視聴における固定点から得られる情報を組み合わせた潜在拡散モデルであり、人がシーンを視た後に理解する图像に一致するメタメア（見かけは異なるが認識される内容は同じである画像）の生成を実現します。高解像度および低解像度（すなわち、foveated）の両方から画像を生成することは、ノベルな画像生成問題を提起しており、これを解決するためには、固定点の詳細特徴と周辺部の劣化特徴（シーンコンテキストを捉える）を融合する、foveated シーンの二流表現を導入する必要があります。MetamerGen が生成した画像が潜在的な人間のシーン表現とどのように感知的に一致しているかを評価するために、生成画像と元の画像間で「同じか」「異なるか」を判断させる行動実験を行いました。これにより、閲覧者の潜在的なシーン表現に対して実際にメタメアであるシーンの生成を特定しました。 MetamerGen はシーン理解を深めるための強力なツールです。我々の概念実証分析は、人間の判断に貢献した多段階の視覚処理における特定の特徴を明らかにしました。ランダムな固定点に条件付けられていてもメタメアを生成できる一方で、生成したシーンを閲覧者の自身の固定した領域に条件付けさせた場合、高レベルの語彙的な整合性がメタメア性の最も強い予測因子となることを発見しました。

Original Content

arXiv:2601.11675v2 Announce Type: replace Abstract: Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.