arxiv_cs_cv 2026年2月10日

オープンソースの AI 生成画像検出モデルの即座の性能はどの程度か：包括的なベンチマーク研究

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Translated: 2026/3/15 19:01:51

deepfake-detectionzero-shot-learningimage-generationmachine-learning-benchmarkmisinformation-combat

Japanese Translation

arXiv:2602.07814v1 発表タイプ：新しい要約: AI 生成画像がデジタルプラットフォームに増大するにつれ、信頼できる検出手法は誤情報の対策とコンテンツの真実性を維持する上で不可欠なものとなっており、多数のディープフェイク検出手法が提案されています。しかし、既存のベンチマークはほとんどがファインチューニング済みのモデルを評価しており、実務家にとって最も一般的なデプロイメントシナリオである即座（ゼロショット）性能に関する理解に決定的な空白を生み出しています。本研究では、12 種類の多様なデータセット（291 つの生成器を含む 260 万枚の画像サンプル）を対象とした、16 つの最先端検出手法およびそれらの 23 種類のプリトレイニング検出バリエーション（特定の検出器の複数リリースバージョンのため）を対象とした、最初となる包括的なゼロショット評価を提示しました。我々の体系的な分析は驚くべき発見をもたらしました：(1) 普遍的な勝者は存在せず、検出器のランキングは極めて不安定を示します（データセットペア間での Spearman ρ：0.01〜0.87）；(2) 最高の検出器（平均精度 75.0%）と最悪のもの（37.5%）の間には、37 パーセントポイントの性能差が見られます；(3) トレーニングデータの一致が泛化に決定的に影響し、構成が同じ検出器ファミリー内で最大 20〜60% の性能変異を引き起こします；(4) 現代の商業生成器（Flux Dev、Firefly v4、Midjourney v7）はほとんどすべての検出器を凌駕し、平均精度は 18〜30% しかありません；(5) クロスデータセット泛化に影響を与える 3 つの系統的な失敗パターンを特定しました。統計解析は検出器間の有意な性能差を確認しました（Friedman 検定：χ²=121.01, p<10⁻¹⁶, Kendall W=0.524）。我々の知見は「一つで全てに当てはまる」検出器のパラダイムに疑問を投げかけ、実践者は公開されたベンチマーク性能に依存するのではなく、独自の脅威景観に基づいて慎重に検出器を選択すべきであるという実行可能で適用性の高いデプロイ指南を提供しました。

Original Content

arXiv:2602.07814v1 Announce Type: new Abstract: As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$\rho$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $\chi^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.