arxiv_cs_cv 2026年4月24日

解釈可能な顔の動的挙動：ディープフェイクの行動的特徴と受容的特徴の痕跡

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Translated: 2026/4/24 19:46:20

deepfakesfacial-dynamicsinterpretable-aimachine-learningbehavioral-bio-signals

Japanese Translation

arXiv:2604.21760v1 Announce Type: new 要約: ディープフェイク検出研究は、強力なベンチマーク性能を有しつつも、真の顔の動きと操作された顔の動きを区別する何物かを限られた洞察しか提供する機械学習アプローチへと大きく収束しました。本稿は、顔の動的挙動の生体行動的特徴に基づく解釈可能な代替アプローチを提示し、計算機検出戦略と人間の受容判断の関係性を評価します。我々は顔の動きの核心的低次元パターンを特定し、そこに従って時系列特性を用いて時系列構造を特徴付けました。これらの特徴で訓練された従来の機械学習分類器は、高位時系列の不規則性（操作された顔の動的挙動よりも真の動的挙動においてより顕著なもの）によって駆動され、偶然率以上の適度なだが有意なディープフェイク分類を行いました。特に、 emotive expression（感情表現）を含むビデオに対する検出精度は、それらを含めないビデオよりも著しく高かった。感情的な価値分類分析は、さらに emotive signal（感情的信号）がディープフェイクで系統的に劣化されていることを示し、それが emotive dynamics が検出に与える差別的な影響を説明しました。さらに、我々はモデルの決定と人間の受容検出の間の関係を評価することで、説明可能性の追加かつしばしば見落とされる次元を提供しました。モデルと人間の判断は emotive ビデオでは一致し、non-emotive ビデオでは不一致し、出力が揃っていたとしても、その下層の検出戦略は異なることになりました。これらの発見は、顔置換ディープフェイクが測定可能な行動の指紋を保持し、それが特に感情表現の最中に最も顕著であることを示しています。さらに、モデルとの人間の比較は、解釈可能な計算機特徴と人間の受容が、検出のための補完的な道筋を提供し、冗長な道筋ではないかもしれないことを示唆しています。

Original Content

arXiv:2604.21760v1 Announce Type: new Abstract: Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.