arxiv_cs_cv 2026年4月24日

何が言及されていないか？多モーダルニュースプレビューにおける誤解を誘う省略の検出と修正

What's Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews

Translated: 2026/4/24 19:51:10

multimodalnews-previewllmmisinformationimage-text-pair

Japanese Translation

arXiv:2601.05563v3 Announce Type: replace 摘要：事実に即している場合でも、SNS のニュースプレビュー（画像と見出しのペア）は、重要な文脈を選択的に省略することで解釈のズレを誘発し、読者に全文が支持するものと異なる判断を形成させます。この陰謀的な害は明示的な誤情報よりも繊細ですが、まだ十分に研究されていません。このギャップを埋めるために、私たちはプレビューベースの理解と文脈ベースの理解をシミュレートするマルチステージパイプラインを開発し、MM-Misbenchmark の構築を可能にしました。MM-Misleading を使用して、我々はオープンソース LVLM を系統的に評価し、省略ベースの誤解検出において顕著な視界盲区を特定しました。さらに、私たちは OMGuard を提案します。OMGuard は、(1) 誤解検出のための解釈感知学習と (2) 論理に導かれた誤解内容修正を組み合わせています。ここで、明示的な論理が見出しの書き換えを導き、誤解する印象を減少させます。実験は、OMGuard が 8B モデルの検出精度を 235B LVLM のレベルまで向上させ、著しく強いエンドツーエンドの修正を実現することを示しています。さらなる分析は、誤解は通常、背景の欠如のようなローカルな物語の変化から生じ、グローバルなフレームの変化から生じることを示しており、テキストだけの修正が失敗する画像駆動型のケースを特定し、視覚的介入の必要性を強調しています。

Original Content

arXiv:2601.05563v3 Announce Type: replace Abstract: Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article supports. This covert harm is subtler than explicit misinformation, yet remains underexplored. To address this gap, we develop a multi-stage pipeline that simulates preview-based and context-based understanding, enabling construction of the MM-Misleading benchmark. Using MM-Misleading, we systematically evaluate open-source LVLMs and uncover pronounced blind spots in omission-based misleadingness detection. We further propose OMGuard, which combines (1) Interpretation-Aware Fine-Tuning for misleadingness detection and (2) Rationale-Guided Misleading Content Correction, where explicit rationales guide headline rewriting to reduce misleading impressions. Experiments show that OMGuard lifts an 8B model's detection accuracy to the level of a 235B LVLM while delivering markedly stronger end-to-end correction. Further analysis shows that misleadingness usually arises from local narrative shifts, such as missing background, instead of global frame changes, and identifies image-driven cases where text-only correction fails, underscoring the need for visual interventions.