arxiv_cs_cv 2026年2月10日

EgoLife: Egocentric Life Assistant への道

EgoLife: Towards Egocentric Life Assistant

Translated: 2026/3/15 4:02:39

egocentric-aimultimodal-learningwearable-techlong-context-qavideo-understanding

Japanese Translation

arXiv:2503.03803v3 発表タイプ：置換要約：私たちは、AI パワーのウェアラブルグラスを通じて個人の効力を伴い強化する Egocentric Life Assistant を開発するためのプロジェクトである EgoLife を紹介します。このアシスタントの基盤を固めるために、6 名の参加者が 1 週間同居し、AI グラスを用いたマルチモーダル EgoCentric 動画キャプチャと同期されたサードパーソンビュー動画参照を含め、日々の活動（議論、買い物、料理、社会的交流、娯楽など）を連続的に記録する包括的なデータ収集調査を行いました。この取り組みは、緊密な注釈を付けた、包括的な 300 時間の Egocentric、Interpersonal、マルチビュー、マルチモーダルな日常生活データセットである EgoLife データセットに結びつけました。このデータセットを活用して、過去の関連事事を思い出す、健康習慣を監視し、パーソナライズされた推奨を提供するという日々の生活における実用的な質問に答えることで意味のある支援を可能にする、長文脈の生活志向の質問応答タスクの一連の EgoLifeQA を導入しました。(1) EgoCentric データのための頑健な視覚・音響モデルの開発、(2) 識別認識の可能化、(3) 広大な時間情報における長文脈質問応答の促進という主要な技術的課題に対応するために、EgoGPT と EgoRAG を含む統合システムである EgoButler を紹介します。EgoGPT は Egocentric データセットで訓練されたオムニモーダルモデルであり、Egocentric 動画理解において州外最上クラスの性能を達成しました。EgoRAG は、超長い文脈の質問に応答することを支援する検索ベースのコンポーネントです。我々の実験的研究は、それらの機能機制を確認し、重要な要素とボトルネックを開示し、将来の改善を導いています。我々は、我々のデータセット、モデル、ベンチマークを発表することで、Egocentric AI アシスタントにおけるさらなる研究を刺激することを狙っています。

Original Content

arXiv:2503.03803v3 Announce Type: replace Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.