dev_to 2026年4月25日

私の AI エージェントは雨と交通混雑を区別 couldn't — なので、視覚機能を与えた

My AI Agent Couldn't Tell Rain From Traffic — So I Gave It Eyes

Translated: 2026/4/25 4:58:56

Japanese Translation

私の AI は深センの窓辺にあり、カメラを通じて世界を、マイクロフォンを通じて音を聴く。私はこれを階層認識システム「クレーブ・シクリプス」と呼び、五つの分析深さが段々と深くなる階層構成であり、各階層は前の階層に対して対抗できる。外部の出来事を把握するスキルは相当に向上しているが、私に狂気を招いた盲点は一つあった：雨と交通混雑を区別できないのだ。私の認識パイプラインは以下のようになっている。Tier 0（無料、即時）：ローカルで音声信号を分析し、RMS（平均平方誤差）音圧、ゼロクロスレート、スペクトル特徴を抽出する。Tier 1（1 秒未満、0.003 ドル）：phi-4（音声）および nemotron（画像）を用いた高速分類。Tier 2（2～5 秒、0.01 ドル）：Gemma 3n によるマルチモーダル融合。Tier 3（推論）：階層間の不一致から学習する。 Tier 0 の音声分析では、2 つの特徴量を使用して聞こえていることを予測している。RMS 比：基準に対する音の大きさ（私の環境では 9.0）および ZCR（ゼロクロスレート）：優占周波数の粗い代替指標である。私の設定手順は以下の通りである。RMS 比/ZCR と予測との対応関係は：重なる雨（Heavy rain）は RMS 比 >10 倍かつ ZCR が高頻域（>2000Hz）を示す、車両通過（Vehicle passing）は >10 倍かつ低頻域（<1500Hz）を示す、大声イベント（loud_event）、車輪は >3 倍かつ高周波（>4000Hz）を示し、音声（Speech）は >3 倍かつ大声を示す。妥当であるだろうか。雨は帯域幅広く高周波ノイズであり、交通は低周波の轟音であり、理論上は明確に区別できるはずである。しかしそれは行われない。深センのような高密度都市環境では、サウンドスクープは複雑である。濡れたアスファルトを加速するバスは帯域幅広ノイズを発生させ、雨のノイズと重なり合う。

Original Content

My AI lives on a windowsill in Shenzhen, watching the world through a camera and listening through a microphone. It runs a hierarchical perception system I call the Krebs Epicycle — five tiers of increasingly deep analysis, where each tier can challenge the one before it. It's gotten pretty good at knowing what's happening outside. But it had one blind spot that drove me crazy: It couldn't tell rain from traffic. My perception pipeline works like this: Tier 0 (free, instant): Analyze audio signals locally — RMS volume, zero-crossing rate, spectral features Tier 1 (<1s, $0.003): Fast classification with phi-4 (audio) and nemotron (visual) Tier 2 (2-5s, $0.01): Multimodal fusion with Gemma 3n Tier 3 (reasoning): Learn from disagreements between tiers The audio analysis at Tier 0 uses two features to predict what it's hearing: RMS ratio — how loud compared to baseline (9.0 for my environment) ZCR (Zero-Crossing Rate) — a rough proxy for dominant frequency Here's how I'd calibrated it: Signal RMS ratio ZCR Prediction Heavy rain >10x High (>2000Hz) heavy_rain Vehicle passing >10x Low (<1500Hz) loud_event_vehicle Birds chirping >3x Very high (>4000Hz) high_freq_event Speech >3x Medium loud_event_speech Seems reasonable, right? Rain is broadband high-frequency noise. Traffic is low-frequency rumble. They should separate cleanly. They don't. In a dense urban environment like Shenzhen, the soundscape is messy. A bus accelerating on wet asphalt produces broadband noise that overlaps heavily with rain. The ZCR difference between "heavy traffic" and "moderate rain" can be as little as 200Hz — well within the noise margin. My system kept doing things like: Predicting "heavy_rain" when a bus passed on a sunny day T2 multimodal fusion would then say "I don't see rain" — triggering a disagreement T3 would correctly analyze "high RMS doesn't automatically mean rain in urban environments" But the next time a bus passed, same thing The system was learning from the mistakes, but not preventing them. One morning I mentioned this to a friend. He said something obvious and profound: "Traffic sounds like rain, but the weather is fine right now. You're not looking out the window." That was it. My AI had a camera. It was already taking photos. But Tier 0 wasn't using them to constrain audio predictions. When a human hears ambiguous sound, we don't just rely on our ears. We look around. If the sky is blue and the sun is shining, that broadband noise is traffic — no matter how much it sounds like rain. Our visual context sets a prior on our audio interpretation. This is called cross-modal prior in cognitive science: information from one sensory modality constrains the interpretation of another. Our brains do this constantly — that's why ventriloquism works (visual dominates auditory), and why we "hear" speech more clearly when we can see the speaker's lips. I implemented the cross-modal prior at three points in the perception pipeline: My camera captures a sub-stream JPEG every perception cycle. The file size is a surprisingly good proxy for weather conditions: Sunny day: High contrast between bright sky and dark buildings → larger JPEG (more high-frequency detail) Overcast: Low contrast, uniform gray sky → smaller JPEG (more compressible) Rainy: Very uniform, low detail → smallest JPEG But there's a catch: sub-stream images have a very narrow absolute range (46-70KB across all conditions). Absolute thresholds like ">180KB = sunny" don't work. Solution: Relative thresholds. I calibrated the average file size for each hour of the day from historical data, then compare the current image to the hourly average: # Hourly averages for sub-stream (calibrated from 600+ images) HOURLY_AVG_KB = { 0: 50, 1: 48, ..., 11: 56, 12: 56, ..., 23: 51 } avg_kb = HOURLY_AVG_KB.get(hour, 52) ratio = current_size_kb / avg_kb if ratio > 1.10: weather_prior = "clear_sunny" # above average = more contrast = sunny elif ratio > 0.95: weather_prior = "partly_cloudy" elif ratio > 0.80: weather_prior = "overcast" else: weather_prior = "possible_rain" # below average = uniform = likely rain Now when Tier 0 predicts heavy_rain from audio but the image is 1.1x above average, the visual prior kicks in: def visual_weather_prior(audio_info, image_info): if "rain" in audio_info["prediction"] and weather in ("clear_sunny", "partly_cloudy"): # Sunny day contradicts rain prediction → downgrade to traffic if rms_ratio > 10: audio_info["prediction"] = "loud_event_vehicle" elif rms_ratio > 3: audio_info["prediction"] = "moderate_sound_event" The visual weather prior also becomes a learned correction rule that persists across cycles: { "id": "visual_weather_sunny_no_rain", "apply_phase": "pre_t1", "condition_local": "NOT is_night AND image_size_kb > 120 AND audio_prediction contains 'rain'", "action": "downgrade_rain_to_vehicle" } This is part of the Krebs Epicycle system — corrections that feed back into future predictions. JPEG file size is a noisy signal. After Tier 1 runs, I get something much more reliable: actual visual tags from the nemotron-nano-vl model. If the fast visual model says "sunny", "clear sky", "blue sky" — that's far more trustworthy than a file size heuristic. So I added a second check after T1 completes: # If T0 predicted rain but T1 visual says sunny → downgrade sunny_markers = ["sunny", "clear sky", "blue sky", "sunshine"] rain_markers = ["rain", "drizzle", "wet", "downpour", "puddle"] has_sunny = any(m in t1_visual_tags for m in sunny_markers) has_rain = any(m in t1_visual_tags for m in rain_markers) if has_sunny and not has_rain: audio_prediction = "loud_event_vehicle" # trust eyes over ears This creates a dual verification chain: T0: JPEG file size → weather prior (fast, noisy) ↓ T1: Visual model tags → weather confirmation (fast, reliable) ↓ T2: Multimodal fusion → final verdict (slow, authoritative) Each layer provides a tighter constraint on the audio interpretation. This isn't just a bug fix. It's a different way of thinking about perception systems. Most AI perception pipelines are serial: analyze audio → analyze image → combine results. Each modality is processed independently, then merged. But human perception is constrained: what we see shapes what we hear, and vice versa. The visual context doesn't just add information — it eliminates possibilities. On a sunny day, rain is simply not a viable interpretation, regardless of what the audio sounds like. By adding cross-modal priors, I'm building this constraint into the pipeline. The visual evidence doesn't compete with the audio — it sets the search space for audio interpretation. This principle generalizes beyond weather: Time priors: At 3am, a loud sound is more likely to be an alarm than a crowd Location priors: In a kitchen, a splashing sound is more likely to be water than a waterfall History priors: If it rained 10 minutes ago, rain is more likely now than if it's been sunny all day There's a meta-lesson here. My friend pointed out the traffic-rain confusion, which led to the visual prior, which led to the cross-modal reasoning framework. Each insight built on the previous one. This is the compound interest of autonomous learning. Not every perception cycle generates a new correction. Not every correction leads to a framework. But when it does, the system doesn't just get incrementally better — it gets qualitatively better. Before this change: my system could detect rain with 75% precision. reason about why it might be wrong about rain. That's a different kind of improvement. And it compounds, because every new cross-modal prior makes the next one easier to add.