arxiv_cs_cv 2026年4月24日

S1-VL: 科学的マルチモーダル推論モデル「思考と画像」

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Translated: 2026/4/24 19:43:47

s1-vlmultimodal-reasoningchain-of-thoughtimage-manipulationscientific-data

Japanese Translation

arXiv:2604.21409v1 Announce Type: new Abstract: 私たちは、科学分野向けのマルチモーダル推論モデル「S1-VL」を提示する。このモデルは、2 つの補完的な推論パラダイムをネイティブにサポートしている：1 つ目は「科学的推論」であり、構造化された Chain-of-Thought に依存している。もう 1 つ目は「思考と画像」であり、推論中に Python コードを実行することでモデルが能動的に画像を操作可能にする。思考と画像のモードでは、モデルは画像処理コードを生成・実行し、サンドボックス環境で中間的な視覚的结果を取得し、多転往反復的な方法で推論を続ける。この設計は、高解像度の科学的チャートの解釈、顕微鏡画像の理解、幾何学的支援した推論などの挑戦的なシナリオにおいて特に効果的である。トレーニングデータの構築においては、6 つの分野（数学、物理学、化学、天文学、地理学、生物学）にわたる科学的多モーダルデータセットを収集した。さらに、推論経路のための 6 次元品質フィルターフレームワークを開発した。既存のデータセットによく見られる冗長で非効果的、かつ誤った視覚操作を軽減するために、私たちは多段階フィルターパイプラインと適応的なデータルーティング戦略を提案した。この戦略は、視覚的信息獲得が低いサンプルを純粋な Reasoning モードのデータに変換し、モデルが画像操作が本当に必要かどうかを学ぶようにする。S1-VL は、科学的マルチモーダル SFT、思考と画像の冷開始 SFT、および SAPO を用いた 2 ステージの強化学習からなる 4 ステージのプロGRESSIVE パイプラインを経て訓練された。S1-VL-32B は Qwen3-VL-32B-Thinking を基盤として構築され、13 つのベンチマークで評価した。実験結果は、S1-VL-32B が HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, V* の 5 つの思考と画像ベンチマークにおいて State-of-the-art のパフォーマンスを示し、物理学や VRSBench などの科学推論ベンチマークでは対照的なシステムを凌駕したことを示している。

Original Content

arXiv:2604.21409v1 Announce Type: new Abstract: We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.