arxiv_cs_cv 2026年4月20日

学習前に見たことを知る：美的品質評価に人間視覚認知の統合

Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment

Translated: 2026/4/20 10:43:12

aesthetic-quality-assessmentvisual-cognitiongaze-alignmentcliparxiv

Japanese Translation

arXiv:2604.15853v1 発表型：新しい要約：自動美的品質評価 (AQA) は、画像を主に静的なピクセルベクトルとして取り扱い、予測を人間の評価スコアと一致させるために、大半が構文認識を通じて行われています。しかし、このパラダイムは、スキャンパス、処理フラウンシー、およびボトムアップの顕著性とトップダウンの意図との相互作用によって形成される動的な視覚探査に由来する人間の美的認知とは一致していません。私たちは、人間のような視覚認知と構文認識を、2 パスウェイアーキテクチャで統合した、新しい認知に基づく AQA パラダイムである AestheticNet を提案します。視覚注意パスは、資源効率のあるコントラストグレースアライメントを使用してオフラインに眼追跡データで事前学習された眼追従視覚エンコーダー（GAVE）として実装され、人間の視覚系からの注意をモデル化します。このパスは、CLIP などの固定構文エンコーダーを使用する構文パスとクロス注意力融合を通じて補完されます。視覚注意は、構文を超えた美的感受の決定要因である前景/背景構造、カラーカスケード、明るさ、および照明を反映する認知優先事項を提供します。仮説検証による実験では、構文のみベースラインの一直線上での改善を示し、眼のモジュールが異なった AQA 背骨と互換性を持つモデルagnostik クレクターとして、人間のような視覚認知の必要性とモジュール性を支持します。当社のコードは https://github.com/keepgallop/AestheticNet で利用可能です。

Original Content

arXiv:2604.15853v1 Announce Type: new Abstract: Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at https://github.com/keepgallop/AestheticNet.