arxiv_cs_cv 2026年4月20日

Stylistic-STORM (ST-STORM): 外観の構文的特性の認識

Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance

Translated: 2026/4/20 10:45:09

self-supervised-learningdeep-learningimage-analysisautonomous-drivingmedical-imaging

Japanese Translation

arXiv:2604.16086v1 Announce Type: new 摘要：自己教師付け学習（SSL）における主要なパラダイムの一つを、MoCo や DINO が示唆している。このアプローチは、照明や幾何学的変化などの特定の画像変換に対して不感応な特徴を捉えることで、頑健な表現を生み出すことを目指している。対象をその外観とは無関係に認識するという目的がある場合は、この戦略が適している。しかし、外観そのものが差別化のシグナルとなる場合は、すぐに逆効果になる。例えば、気象分析において、雨の筋、雪の粒目、大気散乱、および反射と光輪などはノイズではない：それらは本質的な情報を運ぶ。自動運転のような重要な応用分野においては、これらの手がかりを無視するのは危険である。なぜなら、グリップと可視性には直接地面の状態と大気の状態が依存しているからである。私たちは、ST-STORM と称する新しいハイブリッド SSL フレームワークを導入した。ST-STORM は、外観（スタイル）をコンテンツから分離すべき構文のモダリティとして扱う。わたしたちのアーキテクチャは、ゲート機構によって調節され、2 つの潜在ストリームを明示的に分離している。コンテンツブランチは、JEPA スキームと対照的目的と組み合わせ、外観の変動への不変性を促進することで、安定した構文的表現を目指している。一方、スタイルブランチは、特徴予測と再構築を通じて、反対手条件制約の下で、外観のシグニャチャー（テクチャ、コントラスト、散乱）を捉えるように制限されている。ST-STORM を複数のタスクで評価した。タスクには、物体分類（ImageNet-1K）、微細な気象特性の分析、およびメラノーマ検出（ISIC 2024 Challenge）が含まれている。結果は、スタイルブランチが複雑な外観現象を効果的に分離する（Multi-Weather で F1=97%、ISIC 2024 の 10% ラベル付きデータで F1=94%）ことを示した。これは、スタイルブランチを適用することによって、コンテンツブランチの構文的性能（ImageNet-1K で F1=80%）が低下することを引き起こさない一方で、重要な外観情報の保持を改善していることを示している。

Original Content

arXiv:2604.16086v1 Announce Type: new Abstract: One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance