arxiv_cs_cv 2026年4月24日

Seeing Fast and Slow: Learning the Flow of Time in Videos

Translated: 2026/4/24 19:47:36

computer-visionvideo-processingtemporal-reasoningmachine-learningself-supervised-learning

Japanese Translation

arXiv:2604.21931v1 Announce Type: new 摘　要：動画が速けられたのか、遅けられたのかをどのように区別できるか？異なる速度の動画を生成できるか？動画は近代コンピュータビジョン研究の中心となる存在であったが、時間の流れの知覚や制御についてはほとんど注目されてこなかった。本稿では、時間を手がかり可能な視覚的概念として研究し、動画の時間の流れについて論じ・操作するためのモデルを開発する。まず、動画に天然に存在するマルチモーダルな手がかりや時間的構造を活用し、自己教師学習的手法を用いて速度変化を検出したり、再生速度を推定したりする。次に、学習された時間的な推論モデルを用い、現在までに最も大量の慢速再生動画データセットを、ノイジーな野生環境からのデータで構築したことを示す。このように撮影された慢速再生動画（通常は高速度カメラで撮影される）は、標準的な動画と比較して著しく豊かな時間的な詳細を含んでいる。このデータを用いて、速度条件付き動画生成（指定された再生速度での動作生成）や、時間的な超解像（低フレームレートやボケた動画をフレームレートが高く、微細な時間的な詳細を持つ順次に変換）を含む時間制御に優れたモデルを開発した。我々の発見は、動画を学習する際に、操作可能な知覚的次元としての時間を示唆しており、時間制御可能な動画生成、時間的法の検出、そして時間の経過とともに出来事が展開するよう理解できるより豊かな世界モデルの可能性を秘めている。

Original Content

arXiv:2604.21931v1 Announce Type: new Abstract: How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.