arxiv_cs_cv 2026年4月20日

EventCrab: フレームとポイントのシナジーを活用したエベントベースなアクション認識とそれを超へる

EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

Translated: 2026/4/20 10:48:53

event-based-action-recognitionevent-crabspatiotemporal-featuresarxiv-2411.18328action-recognition

Japanese Translation

arXiv:2411.18328v2 Announce Type: replace 要旨：エベントベースなアクション認識（EAR）は、従来のアクション認識と比較して、高時間分解能キャプチャとプライバシー保全という利点を有しています。現在、最先端の EAR ソリューションは一般的に 2 つの制度に従っています：非構造化されたエベントストリームを密な構造化されたエベントフレームに投影し、強力なフレーム固有のネットワークを採用するものか、あるいは軽量なポイント固有のネットワークを採用して疎な非構造化されたエベントポイントに直接対処するものか。しかし、これらの 2 つの制度は、異同期エベントデータの独自の高密度な時間的および低密度な空間的性質に応えるという根本的な問題に盲目的です。本記事では、異同期エベントデータのこれらの性質を適切に考慮する「環境認識」フレームワーク、すなわち EventCrab を提示します。EventCrab は、高密度なエベントフレーム向けの「軽い」フレーム固有ネットワークと、疎なエベントポイント向けの「重い」ポイント固有ネットワークを統合し、精度と効率性を調整します。さらに、異なるエベントフレームとポイントを架橋するための共同フレームテキストポイント表現空間を確立します。具体的には、異同期エベントポイントに内在する独自の空間時間関係をより十分に活用するため、2 つの戦略を「重い」ポイント固有埋め込みに設計しました：i) 生エベントストリームから文脈化されたエベントポイントを引き出すスパイキング様式コンテクストアレンジャー（SCL）。ii) ヒルベルトスキャン方式でエベントポイントの長時間空間時間特性をさらに探求するエベントポイントエンコーダー（EPE）。4 つのデータセットにおける実験は、我々が提案した EventCrab の顕著な性能を証明し、特に SeAct で 5.17% の向上、HARDVS で 7.01% の向上をもたらしました。

Original Content

arXiv:2411.18328v2 Announce Type: replace Abstract: Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.