arxiv_cs_cv 2026年2月10日

Gaussian-constrained LeJEPA 表現による教師なしシーン発見とポーズ整合性

Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency

Translated: 2026/2/11 13:35:59

Japanese Translation

arXiv:2602.07016v1 アナウンス種別: new 概要: 非構造化された画像コレクションからの教師なし3Dシーン再構築は、特に画像が複数の無関係なシーンから取得され視覚的曖昧性が大きい場合に、コンピュータビジョンにおける根本的な課題であり続ける。Image Matching Challenge 2025 (IMC2025) は、外れ値や混合コンテンツを含む実世界条件下でシーン発見とカメラポーズ推定の両方を要求することで、これらの困難を浮き彫りにしている。本論文では、LeJEPA (Joint Embedding Predictive Architecture) に着想を得た Gaussian-constrained 表現の適用を検討し、これらの課題に対処する手法を探る。段階的に改良した3つのパイプラインを提示し、その最終形として学習された画像 embeddings に対して isotropic Gaussian constraints を課す LeJEPA に着想を得たアプローチを提示する。本研究は新たな理論的保証を導入するのではなく、これらの制約が実際にクラスタリングの一貫性 (clustering consistency) やポーズ推定の堅牢性 (pose estimation robustness) にどのように影響するかを経験的に評価する。IMC2025 上の実験結果は、Gaussian-constrained embeddings がヒューリスティック駆動のベースラインと比較してシーンの分離やポーズの妥当性 (pose plausibility) を改善し得ること、特に視覚的に曖昧な設定でその効果が顕著であることを示した。これらの知見は、理論的に動機付けられた表現制約が self-supervised learning の原理と実用的な structure-from-motion パイプラインを結び付ける有望な方向性を示唆している。

Original Content

arXiv:2602.07016v1 Announce Type: new Abstract: Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.