arxiv_cs_cv 2026年4月24日

露光階級の合成による線形画像生成

Linear Image Generation by Synthesizing Exposure Brackets

Translated: 2026/4/24 19:40:31

linear-image-generationexposure-bracketsimage-processingdiffusion-modelstext-to-image

Japanese Translation

arXiv:2604.21008v1 発表型：新規要約: プロの編集は、光子がセンサーに当たった後に、画像信号処理 (ISP) パイプラインを通じて表示参照画像として出力されます。しかし、これらの画像は入射光に忠実ではなく、ダイナミックレンジが圧縮され、主観的な好みに基づきスタイル化されています。一方、RAW 画像は非線形トーンマップ処理前にセーナーの直接的な信号を記録します。カメラ応答曲線補正とデモゼイスングの後に、それらは線形画像に変換でき、これは実際の照度を直接反映し、センサー固有の因子に不変なシーン参照表現となります。画像センサーには優れたダイナミックレンジとビット深さがあり、線形画像は表示参照の画像よりも豊富な情報を含み、後処理中の編集の余地を残します。しかし、現在の生成モデルは主に表示参照画像を合成しており、これが下流編集を本質的に制限しています。本稿では、テキストプロンプトに条件付けて高品質なシーン参照線形画像を合成し、フルダイナミックレンジを保持したプロの編集用途のテキストから線形画像生成という課題に対処します。線形画像生成は困難で、潜在的な拡散モデルで事前トレーニングされた VAE は、より高いダイナミックレンジとビット深さのために極端なハイライトとシャドウを同時に保存できずに陥りやすいです。このために、線形画像をダイナミックレンジの特定部分をそれぞれ捉える露光階級のシーケンスとして表現し、テキスト条件付け露光階級生成のための DiT ベースのフローマッチングアーキテクチャを提案しました。また、テキストガイドによる線形画像編集や、ControlNet を介した構造条件付け生成といった下流応用を実証しました。

Original Content

arXiv:2604.21008v1 Announce Type: new Abstract: The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.