dev_to 2026年4月25日

Seeing Fast and Slow: Learning the Flow of Time in Videos

Translated: 2026/4/25 1:00:13

video-processingself-supervised-learningtemporal-augmentationcomputer-visiondeep-learning

Japanese Translation

Time is everywhere in video — yet most computer vision models treat it as an afterthought. We compress temporal information into feature vectors, shuffle frames during training, and generally act like order doesn't matter. A new paper from researchers at the University of Washington and Google challenges that assumption head-on, treating time itself as a learnable visual concept. The core insight is deceptively simple: if you can tell whether a video has been sped up or slowed down, you fundamentally understand something about how motion unfolds in the real world. The paper frames temporal perception as a self-supervised learning problem — no manual labels needed. Rather than annotating playback speed by hand, the authors exploit a signal that's already baked into videos: natural multimodal cues. Audio pitch, optical flow magnitude, and the statistical texture of motion all shift predictably when you change playback speed. The model learns to detect these signatures and estimate absolute playback speed from raw video. This framing is elegant because it sidesteps the annotation bottleneck that plagues so many video understanding tasks. The supervision comes from the data itself. The paper delivers four concrete things: 1. Speed detection and estimation models. Trained self-supervised on in-the-wild video, these models learn to classify whether a clip has been temporally altered and estimate its approximate playback multiplier. The temporal reasoning transfers well to downstream tasks. 2. A large-scale slow-motion dataset. Using the speed estimation models as a filter, the authors mine the largest slow-motion video dataset assembled to date from noisy web sources. High-speed camera footage is normally expensive to collect — this pipeline extracts it cheaply at scale. Slow-motion clips contain substantially denser temporal information per second of real time, making them valuable training data for any model that needs to reason about fine-grained motion. 3. Speed-conditioned video generation. Built on the curated slow-motion data, this model generates video at a specified playback speed. You give it a motion description and a speed multiplier; it produces plausible footage at that temporal rate. This is a meaningful step beyond current video diffusion models, which produce motion at whatever speed the training distribution happened to encode. 4. Temporal super-resolution. Given a low-FPS, motion-blurred clip, the model synthesizes the missing high-frequency temporal detail, producing a smooth high-FPS output. This is harder than spatial super-resolution because you're hallucinating events that occurred between frames, not just pixels. For ML engineers building video systems, the implications branch in several directions: Data curation pipelines. The speed estimation model is a ready-made filter for finding temporally rich content at scale. If you're training any motion-aware model, mining slow-motion footage programmatically is now feasible. Controllable generation. Speed conditioning adds a new axis of control to video generation. Product demos, sports replays, scientific visualization — anything where you want to say "show me this motion at 0.25x" without manually interpolating frames. Temporal forensics. Detecting whether a video has been sped up, slowed down, or had frames dropped is directly useful for media authenticity workflows. The same self-supervised signal that trains the generative models can serve as a manipulation detector. World models. The authors gesture toward a longer-term payoff: models that understand how events unfold over time rather than just recognizing static patterns. Temporal super-resolution and speed conditioning are early building blocks for that. Limitations A few things worth watching before you reach for this in production: Distribution shift. The self-supervised training signal relies on videos where speed changes a

Original Content

Seeing Fast and Slow: Learning the Flow of Time in Videos Time is everywhere in video — yet most computer vision models treat it as an afterthought. We compress temporal information into feature vectors, shuffle frames during training, and generally act like order doesn't matter. A new paper from researchers at the University of Washington and Google challenges that assumption head-on, treating time itself as a learnable visual concept. The core insight is deceptively simple: if you can tell whether a video has been sped up or slowed down, you fundamentally understand something about how motion unfolds in the real world. The paper frames temporal perception as a self-supervised learning problem — no manual labels needed. Rather than annotating playback speed by hand, the authors exploit a signal that's already baked into videos: natural multimodal cues. Audio pitch, optical flow magnitude, and the statistical texture of motion all shift predictably when you change playback speed. The model learns to detect these signatures and estimate absolute playback speed from raw video. This framing is elegant because it sidesteps the annotation bottleneck that plagues so many video understanding tasks. The supervision comes from the data itself. The paper delivers four concrete things: 1. Speed detection and estimation models. Trained self-supervised on in-the-wild video, these models learn to classify whether a clip has been temporally altered and estimate its approximate playback multiplier. The temporal reasoning transfers well to downstream tasks. 2. A large-scale slow-motion dataset. Using the speed estimation models as a filter, the authors mine the largest slow-motion video dataset assembled to date from noisy web sources. High-speed camera footage is normally expensive to collect — this pipeline extracts it cheaply at scale. Slow-motion clips contain substantially denser temporal information per second of real time, making them valuable training data for any model that needs to reason about fine-grained motion. 3. Speed-conditioned video generation. Built on the curated slow-motion data, this model generates video at a specified playback speed. You give it a motion description and a speed multiplier; it produces plausible footage at that temporal rate. This is a meaningful step beyond current video diffusion models, which produce motion at whatever speed the training distribution happened to encode. 4. Temporal super-resolution. Given a low-FPS, motion-blurred clip, the model synthesizes the missing high-frequency temporal detail, producing a smooth high-FPS output. This is harder than spatial super-resolution because you're hallucinating events that occurred between frames, not just pixels. For ML engineers building video systems, the implications branch in several directions: Data curation pipelines. The speed estimation model is a ready-made filter for finding temporally rich content at scale. If you're training any motion-aware model, mining slow-motion footage programmatically is now feasible. Controllable generation. Speed conditioning adds a new axis of control to video generation. Product demos, sports replays, scientific visualization — anything where you want to say "show me this motion at 0.25x" without manually interpolating frames. Temporal forensics. Detecting whether a video has been sped up, slowed down, or had frames dropped is directly useful for media authenticity workflows. The same self-supervised signal that trains the generative models can serve as a manipulation detector. World models. The authors gesture toward a longer-term payoff: models that understand how events unfold over time rather than just recognizing static patterns. Temporal super-resolution and speed conditioning are early building blocks for that. Limitations A few things worth watching before you reach for this in production: Distribution shift. The self-supervised training signal relies on videos where speed changes are detectable via audio and optical flow. Silent clips, purely static scenes, or heavily compressed web video may degrade estimation accuracy. Hallucination risk in temporal super-resolution. Synthesizing missing frames is fundamentally generative — the model is making educated guesses about what happened between observations. For safety-critical or forensic use cases, those guesses need to be treated with appropriate skepticism. Scale of the slow-motion dataset. While described as the largest to date, "largest" in slow-motion video is still a narrower domain than general video. Generalization to long-tail motion types (industrial machinery, micro-scale biology) remains an open question. Compute. Speed-conditioned generation and temporal super-resolution both sit on top of diffusion-based architectures. Inference cost is non-trivial for real-time applications. Overall this is a well-scoped paper that turns a gap in video understanding — temporal perception — into a practical engineering pipeline. The self-supervised framing is the key unlock: it makes the whole thing trainable at scale without human annotators. Paper: https://arxiv.org/abs/2604.21931v1 tags: machinelearning, computervision, videogeneration, deeplearning 🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/834nq0l6