arxiv_cs_cv 2026年4月24日

植物学者のように考えること：Intent-Driven Chain-of-Inquiry を用いてマルチモーダル言語モデルを課題化

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

Translated: 2026/4/24 19:40:26

chain-of-thoughtmultimodal-aiplant-pathologyvisual-reasoningdiagnostic-accuracy

Japanese Translation

arXiv:2604.20983v1 発表タイプ：新規要約：ビジョン評価は通常、多段階のプロセスを通じて行われます。現代的な多くの分野で、専門家は構造化され、根拠に基づいた適応的な質問を用いて画像を分析します。植物病理学において、植物学者は葉画像を検査し、視覚的な手がかりを特定し、診断の意図を推論し、種、症状、重症度に適応するターゲットされた質問でさらに深掘りを行います。この構造化された探求は、正確な疾病診断と治療法の策定において不可欠です。しかし、現在のビジョン・言語モデルは単一回の質問応答に基づいて評価されています。このギャップに対処するために、我々は PlantInquiryVQA というベンチマークを導入し、植物診断における多段階で Intent-Driven の視覚的思考を学ぶために提案しました。我々は、地上化された視覚的手がかりと明示的な认识論的な意図に条件付けされた順序立てた質問応答系列として診断経路をモデル化する Chain of Inquiry（探究の連鎖）フレームワークを形式化しました。我々は、視覚的地方的、重症度ラベル、および分野固有の推理テンプレートが付記された、24,950 件の専門家が整理した植物画像と 138,068 件の質問応答対のデータセットを公開しました。最上位マルチモーダル大規模言語モデル上の評価は、彼らが視覚的症狀を十分に描写することはできても、安全な臨床推理や正確な診断には苦労していることを示しています。重要なことは、構造化された質問ガイド付きの探究は診断の正解率を大きく向上させ、幻覚（虚偽の情報の作成）を減少させ、推理の効率を高めることです。我々は、PlantInquiryVQA が、専門的な植物学者のように推理するダイナゴストンエージェントを訓練するために、静的分類器ではなく、研究を推進する基盤となるベンチマークとなることを願っています。

Original Content

arXiv:2604.20983v1 Announce Type: new Abstract: Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.