arxiv_cs_cv 2026年4月24日

セマンティックな微細な整合性と混合専門家によるドメイン間評価の再考：顔偽造検出のために

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Open original article

Translated: 2026/4/24 19:44:27

face-forgery-detectioncross-domain-evaluationclipmixture-of-expertsai-security

Japanese Translation

arXiv:2604.21478v1 Announce Type: new Abstract: 生成モデルの急速な発展に伴い、視覚データの偽造検出は社会および経済セキュリティにおいてますます重要な役割を果たしています。既存の顔偽造検出器は、データセット間の一般化能力の不十分さにより、まだ満足性能を達成できていません。この現象を引き起こした主要要因は、適切な指標の不足です。一般的に使用されるクロスデータセット AUC 指標は、検出スコアがデータドメイン間で大幅に変化するという重要な問題を明らかにしていません。クロスドメインスコア比較性を明示的に評価するため、我々は、1 つのデータセットの実サンプルと別のデータセットの偽サンプル（およびその逆）を対比させることで、データセットペア間での AUC を計算できる評価指標である extbf{Cross-AUC} を提案しました。代表的な検出器を extbf{Cross-AUC} 指標下で評価することは、過見落されつつある強固性の問題を暴露し、実質的な性能低下を示唆しました。さらに、我々は、画像テキストの Patch レベル整合性モジュールと、異なる顔の領域から特化された専門家へルートの顔領域混合専門家モジュールを含む、新しい extbf{S}emantic extbf{F}ine-grained extbf{A}lignment and extbf{M}ixture-of-Experts ( extbf{SFAM}) フレームワークを提案しました。公開データセットに対する広範な定性的および定量的な実験により、提案手法は、さまざまな適した指標に対して最前線の手法よりも卓越した性能を実証しました。

Original Content

arXiv:2604.21478v1 Announce Type: new Abstract: Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.