dev_to 2026年4月25日

機械が「時間」を感じるようになる時

When Machines Learn to Feel Time

Translated: 2026/4/25 5:13:46 翻訳信頼度: 95.6%

video-processingcomputer-visionartificial-intelligencetime-detectiondeep-learning

Japanese Translation

この問題を現実のものにした瞬間ソーシャルメディアのフィードを 10 分間スクロールすると、それを何回も遭遇するでしょう：翅を振る途中で freezes したヒラタウミ_dr_（hummingbird）の映像、その羽が小さな緑色の手のように開かれているもの；バスケ選手のダンクが 4 秒もの時間内に伸びて見えるもの；鋼鉄の屈曲が濡れた硬紙のようになるように見える車事故的な映像が 1/10 の通常の速度で再生されているもの。それから 30 秒後、花が開花するタイムラプス、街が覚醒し、平原を暴風雨が通過する映像—all の世界が、夢のような加速された急流として圧縮される。あなたはこれらどれかを認識するのにも困難を感じることはありません。脳は瞬時に調整し、速度の変化を運動のボケ（motion blur）の見た目、周囲音のテンポ、目の前に展開する原因と効果のリズムによって文脈化します。あなたは直感的に、ヒラタウミの映像はスローモーションだということを知っていて、実際の生活では翅はそんなふうに見えませんから。都市のタイムラプスも、人々が光のさやのように動くわけではありませんから、加速されているということを知っています。しかし、ごく最近まで、私たちのカメラ、動画編集ソフト、ストリーミングプラットフォームを駆動する人工知能システムは、これらが起きていることについてほとんど何も知りませんでした。すべての映像を表面的な価値（face value）で視聴するだけで、これらが起きているという事実を尋ねるだけでなく、さらに回答する能力さえも持ち合わせていませんでした。「ここでは実際に時間は何をしているのか？」という質問について。ユタ・ワシントン大学と Google の研究チームは、これらができるシステムを構築しました。彼らの論文『Seeing Fast and Slow（「速いことと遅いことを見る」）』は、時間をビデオが単に埋められる固定された容器としてではなく、学習可能な、操作可能な次元—機械が教えられて感じ、推定し、最終的に制御できるもの—として扱います。これがなぜ難しいのかを理解するために、コンピュータにとってビデオが何を意味するかを考えましょう：高速で連続して表示された静か画像のスタック、各々はフレーム（frame）と呼ばれます。正しいレートで再生されれば、人間の眼はそれらを運動の欺瞞（illusion）へと縫い合わせます。コンピュータは、しかし、この欺瞞を見ることができません—見えているのは、画像の山です。伝統的には、この画像には「猫がいるか？」、「この人物は笑っているか？」というような、単一の瞬間に根ざした質問を尋ねることで訓練されてきました。それらの瞬間から、それらを繋ぐ流れについての問いではありませんでした。コンピュータに時間について理知させることは、ピアノの照片だけで誰かに音楽がどのように聞こえるかをお教えることに似ています。キーをラベルづけられます。ハンマーと弦を記述できます。しかし、レクイエムと軍楽行進の差を伝えることは、その人が音符が順番に展開する音を実際に聴かせることなしには不可能です。以前の AI システムは、時間を単なる他の空間次元として、そして時間隔が単に高さや幅のような測定すべき容器で、理解すべき体験でなく—時間について理知させるために動画の処理を試みました。彼らは、運動が発生していることが検出できましたが、その運動が速い、遅い、自然な、人工的に操作されたかどうかを尋ねる能力はほとんどありませんでした。速度の変化を検出—ビデオが加速または減速されていることを知る—は、本質的に人間のエディタ、あるいは粗末なルールベースのアルゴリズムに委ねられていました。これが、聞こえるほど重要ではありません。スローモーションの映像は、何百、何千フレームの高速カメラによって撮影される高価な撮影で、通常の映像よりもはるかに豊富な視覚情報を保持します。標準的なスマートフォンカメラは、1 秒間に 30 フレームを取得し、これは人間の知覚のボケと概ね一致します。高速カメラは、1 秒間に 1,000 フレーム以上を取得し、眨眼よりも速いこと事象を凍結します—水滴が水路に衝突したリフ、スプリンターの腱の屈曲、ソープの泡が飛び上がる正確な瞬間。その映像は、AI システムがどのように動いているかを知ることを理解する方法をトレーニングするために財宝の倉庫です。問題は、それを探すこと：通常の速度の映像の巨大な海洋から分離され、野外でそれは規模の巨大な難易度を持つのでした。ここに論文の最初の洞察が優美なものとなりました。研究者たちは、ビデオが元の録画速度から遅くされた場合は、それは署名を持ちますが—必ずしも視覚の署名ではありません。それは音声の署名を持ちます。ヴィ（Think of a vi）

Original Content

The moment that made this problem real Scroll through your social media feed for ten minutes and you'll encounter it dozens of times: a clip of a hummingbird frozen mid-wingbeat, its feathers splayed like a tiny green hand; a basketball player's dunk stretched into four elastic seconds; a car crash replayed at a tenth of normal speed so that steel buckles like wet cardboard. Then, thirty seconds later, a time-lapse of a flower blooming, a city waking up, a storm rolling across a plain — all the world compressed into a dreamy, accelerated rush. You have no difficulty perceiving any of this. Your brain adjusts instantly, contextualizing speed changes by the look of motion blur, the tempo of ambient sound, the rhythm of cause and effect playing out in front of you. You know, instinctively, that the hummingbird clip is slow-motion because wings don't look like that in real life. You know the city time-lapse is sped up because people don't move like flickers of light. But until very recently, the artificial intelligence systems that power our cameras, video editors, and streaming platforms had almost no idea any of this was happening. They watched every video at face value — incapable of asking, let alone answering, the question: what is time actually doing here? A team of researchers from the University of Washington and Google has now built systems that can. Their paper, "Seeing Fast and Slow," treats time not as a fixed container that videos simply fill up, but as a learnable, manipulable dimension — something a machine can be taught to sense, estimate, and ultimately control. To understand why this was hard, consider what a video actually is to a computer: a stack of still photographs, each called a frame, shown in rapid succession. Play them at the right rate and the human eye stitches them into the illusion of motion. The computer, however, sees no illusion — it sees a pile of images. It has traditionally been trained to ask questions like "is there a cat in this image?" or "is this person smiling?" Questions rooted in single moments, not in the flow between them. Teaching a computer to reason about time is a bit like trying to teach someone what music sounds like using only photographs of a piano. You can label the keys. You can describe the hammers and strings. But you cannot convey the difference between a lullaby and a military march without letting the person actually hear the notes unfold in sequence. Previous AI systems tried to reason about video by treating time as just another spatial dimension — as if duration were simply height or width, a container to be measured rather than an experience to be understood. They could detect that motion was occurring but had almost no capacity to ask whether that motion was fast, slow, natural, or artificially manipulated. Detecting speed changes — knowing that a video has been accelerated or decelerated — was essentially left to human editors or crude, rules-based algorithms. This mattered more than it might sound: slow-motion footage, captured by expensive high-speed cameras that record hundreds or thousands of frames per second, contains dramatically richer visual information than ordinary video. A standard smartphone camera captures 30 frames per second, roughly matching the blur of human perception. A high-speed camera captures 1,000 or more, freezing events that happen faster than a blink — the ripple in a raindrop hitting a puddle, the flex of a sprinter's tendon, the exact instant a soap bubble pops. That footage is a treasure trove for training AI systems to understand how things move. The problem was: finding it in the wild, separated from the vast ocean of normal-speed video, was brutally difficult to do at scale. Here is where the paper's first insight becomes elegant. The researchers realized that if a video has been slowed down from its original recording speed, it carries a signature — but not necessarily a visual one. It carries an audio signature. Think of a vinyl record being played at the wrong speed. Play it too slow, and every voice deepens into a rumbling baritone; play it too fast, and singers sound like cartoon chipmunks. The pitch of sound is exquisitely sensitive to the rate at which it is reproduced. This is not a glitch — it is physics. Sound is vibration, and vibration has frequency. Change the speed of playback, change the frequency. Now imagine you are a researcher with access to millions of YouTube videos. Many of them have been slowed down for artistic or editorial effect: sports highlights, nature documentaries, recipe videos showing the pour of honey. When the original footage was shot at high speed and then played back at normal speed, the audio — if any was recorded — gets stretched and distorted. The pitch drops. The rhythm slows. The spectrogram, which is a kind of visual map of sound that shows which frequencies are present at each moment in time, changes shape in characteristic ways. The researchers used this cross-modal clue — the relationship between what you see and what you hear — as free supervision. This is the key move. "Free supervision" in machine learning parlance means finding a signal that teaches the model without anyone having to sit down and manually label thousands of examples. The audio track is already there. It already contains information about speed. The model simply has to be taught to read it. Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba This is called self-supervised learning, and the analogy I find most clarifying is that of a child learning to read a clock by watching how long things take in the real world. No one sits the child down and says "the minute hand advances one tick every sixty seconds." Instead, the child notices that when the minute hand moves from the 12 to the 3, the cartoon show they were watching is now a quarter of the way done. They learn the relationship between the visual symbol (the clock) and the experienced duration (the show) without explicit instruction. The researchers' model did something structurally similar: it learned the relationship between visual motion patterns and audio pitch patterns by watching an enormous quantity of video — and those patterns are the clock. Once the model could reliably detect whether a video clip was slow-motion — and estimate by roughly how much — the researchers had a tool for sorting the internet. They applied this temporal reasoning system to vast repositories of online video and extracted, automatically, the clips that qualified as genuine slow-motion footage: material shot on high-speed cameras, containing that densely-packed temporal information. The result was the largest slow-motion video dataset ever assembled from real-world sources. Think of it like this: imagine you are a sommelier trying to build a cellar of aged wines, but the world's wine supply is a single enormous warehouse where vintage bottles are scattered randomly among bottles of table wine, with no labels on any of them. You develop a palate — a way of tasting the wine in the bottle, figuratively, before you open it — that lets you sort through thousands of bottles quickly and pull out only the ones that have been aged for decades. The researchers built the equivalent palate for video. The resulting cellar is immense and high-quality in a way no previous collection had managed to be. With this rich dataset in hand, the team's ambitions grew larger. They used the slow-motion footage to train two new capabilities that move from perceiving time to creating it. The first is speed-conditioned video generation. Ordinary AI video generators — the kind making headlines by conjuring photorealistic clips of things that never happened — produce motion at a fixed, implicit pace. Ask them to show a runner, and they'll show a runner at whatever speed they happen to have absorbed from their training data. Ask them to show the same runner at half speed, or double speed, or ten times normal speed, and they have no reliable way to comply. They are like an orchestra that can play a piece of music but cannot change tempo on request. The researchers' new model is tempo-aware. You tell it not just what to generate, but how fast time should move within the generated clip. The model has internalized the relationship between speed and the visual texture of motion — the way fast motion blurs certain edges differently, the way slow-motion footage reveals microexpressions on a face or the spray pattern of water — and can modulate those textures deliberately. The second capability is temporal super-resolution. This is perhaps the most technically remarkable of the paper's contributions, and it deserves a careful analogy. You may have encountered image super-resolution: AI systems that take a blurry, low-resolution photograph and sharpen it into something that looks higher-definition. The trick is that the AI has learned, from millions of high-resolution images, what things tend to look like up close — what the texture of skin or stone or fabric looks like at a fine grain — and it uses that learned knowledge to make a plausible guess about what the blurry image would look like if it had been captured with a better camera. Temporal super-resolution does the same thing but for time. Take a video recorded at 30 frames per second — one image every 33 milliseconds. Between each frame, something happened. A hand moved. Water splashed. The AI's job is to hallucinate the missing frames: to invent what the world looked like during those in-between moments, with enough physical and visual plausibility that the result, played back at a higher frame rate, looks genuinely smooth rather than artificially interpolated. The researchers' model, trained on high-speed footage that actually contains those in-between moments, has learned what realistic temporal detail looks like. It can take a blurry 30fps video and produce a plausible 240fps version — one that feels like slow-motion footage rather than a cheap software trick. Like a jazz musician who has heard so many performances that they can improvise a bridge between two musical phrases as if it had always been there. Pause for a moment and let the practical implications land. A documentary filmmaker shooting in the field with a standard camera witnesses an unexpected, fast-moving event: a bird strike, a lightning bolt, an athlete's peak moment. Their footage looks normal-speed and slightly blurry. With temporal super-resolution, the footage can be sharpened into something that looks slow-motion — revealing detail the camera was technically too slow to capture properly. The moment can be recovered. A sports coaching team studying a sprinter's form, a surgeon reviewing a procedure, a physicist analyzing a droplet experiment — each of these fields depends on seeing events that happen faster than standard cameras allow. High-speed cameras are expensive, bulky, and require significant setup. A post-production tool that can enhance ordinary footage offers a genuine democratization of slow-motion analysis. Then there is the darker application the researchers themselves name: temporal forensics. If AI can now learn to detect speed manipulations in video, it becomes possible — in principle — to apply that same detection to videos circulating in the world and flag ones that have been artificially sped up or slowed down to distort the perception of events. A protest that looks chaotic at real speed but appears deliberately violent at slowed-down playback; a speech where a moment of hesitation is stretched to suggest confusion. The same technology that generates manipulated time can be used to detect it. The paper is carefully scoped, and the researchers are honest about what they haven't yet achieved. Speed-conditioned generation can be told how fast to produce motion, but it is not yet producing footage that looks indistinguishable from genuine high-speed camera output at all speeds and subjects. Temporal super-resolution is making educated guesses about what happened between frames — and educated guesses can be wrong in ways that are hard to detect and potentially consequential when the footage is being used as evidence or scientific data. There is also a deeper philosophical concern lurking in any system that learns to interpolate missing moments from video: at what point does "filling in" become "inventing"? The model is doing something genuinely extraordinary — constructing visual reality that was never captured — and the line between plausible enhancement and subtle fabrication is not always clear, even to experts. As these tools become more powerful and more accessible, the question of provenance in video — where did this footage actually come from, and what has been done to it? — becomes more pressing, not less. There is also a mundane practical limit: the whole system depends on finding audio-visual correlations in online video, which means it works best when footage comes with audio and when that audio hasn't been separately edited. Mute video, dubbed video, or footage where the audio was replaced entirely breaks the cross-modal supervision scheme. The audio is a free teacher, but only when it's telling the truth about the original recording. What this paper ultimately demonstrates is that time — the most fundamental dimension of video — is something AI systems can be taught to understand, not just measure. For decades, computer vision research treated a video as a series of spatial problems, questions about where things were, what they looked like, whether they were cats or cars or clouds. The temporal dimension was largely incidental, a backdrop for spatial reasoning rather than a subject of investigation in its own right. This work reframes time as a perceptual object in itself — something with texture, with speed, with the capacity to reveal or conceal information depending on how it flows. Teaching a machine to sense whether time is running fast or slow, and then to manipulate that flow deliberately, is a small step toward a kind of machine perception that feels, for the first time, genuinely temporal. We are used to machines that can see. We are beginning to build machines that can feel — if not time's passage exactly — at least the difference between a world unfolding at natural pace and one that has been stretched or compressed to tell a different story. Whether that capability will be used mostly to reveal the truth of moments we were too slow to see, or to construct moments that never happened at all, will depend on choices that no research paper can make for us. 📄 https://arxiv.org/abs/2604.21931v1 tags: computervision, videogeneration, deeplearning, temporalai 🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/nrghg29y