arxiv_cs_cv 2026年4月24日

コードブックから VLM へ：気候変動に関するソーシャルメディア上の自律的視覚的ディスコース解析の評価

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

Translated: 2026/4/24 19:46:35

vision-language-modelssocial-media-analysisclimate-changecomputer-visionprompt-engineering

Japanese Translation

arXiv:2604.21786v1 Announce Type: new 摘要：ソーシャルメディアプラットフォームは気候変動コミュニケーションの主要な場となり、システム的に解析すれば公衆の関心を動員するコミュニケーション戦略や、それが機能しない戦略を明らかにできる大量の画像と投稿を生成しています。本研究では、コンピュータビジョン手法をソーシャルメディアのディスコース解析に応用する方法を評価し、その研究を支援することを目的としています。本解析には、アプリケーションベースの分類体系設計、モデル選定、プロンプトエンジニアリング、および検証が含まれます。われわれは、X（以前は Twitter）由来の 2 つのデータセットを用いて、6 つのプロンプト可能なビジョン・言語モデルおよび 15 つのゼロショット CLIP 類似モデルをベンチマークしました。これらのデータセットは、動物コンテンツ、気候変動の結果、気候変動対策、画像設定、および画像タイプの 5 つの注釈次元を跨ぐ、1,038 枚の専門家による注釈付き画像セットと、50,000 件のラベルが手動で検証された 120 万枚を超える大型コーポスを含んでいます。ベンチマークされたモデルのうち、Gemini-3.1-flash-lite はすべての上位カテゴリで、両方のデータセットにおいて他を凌駕し、中型のオープンウェイトモデルとの間にはまだ比較的小さな差距が残っています。インスタンスレベルの指標だけでなく、分布論的評価の重要性を推奨します。VLM の予測は、単一画像あたりの精度が適度でも、人口レベルの傾向を確実に回復できるため、大規模なディスコース解析の viable な起点となります。われわれは、チェイン・オブ・サンプリングの推論がパフォーマンスを向上させるのではなく低下させ、かつ注釈次元固有のプロンプト設計がパフォーマンスを改善すると発見しました。われわれは、ツイートの ID およびラベルとともに、コードを https://github.com/KathPra/Codebooks2VLMs.git で公開しています。

Original Content

arXiv:2604.21786v1 Announce Type: new Abstract: Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.