arxiv_cs_ai 2026年4月24日

ChessArena: 大規模言語モデルの戦略的推論能力を評価するためのチェステストベッド

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Translated: 2026/4/24 20:32:39

artificial-intelligencelarge-language-modelsstrategic-reasoningchessevaluation-framework

Japanese Translation

arXiv:2509.24239v4 Announce Type: replace-cross 最近の大規模言語モデル（LLM）は強力な推論能力を示唆していますが、その核心に残る重要な質問が一つあります。これらのモデルは本物の戦略的推論を備えているのか、それとも主にパターン認識に優れているだけなのかです。この問いに対して、私たちは LLM を評価するためのチェスに基づくテストベッドである ChessArena を提案します。チェスには、戦略的推論、精密なルール遵守、および複雑なゲーム状態を追跡する能力が必要です。ChessArena は、LLM が 4 つのプレイモード下で互いに対局する競争的枠組みです。私たちは 800 以上のゲームにわたり 13 つの LLM を評価し、基本的な理解、指法選択、およびパズルの解法等を検証しました。結果は著しい不足を明らかにしました：どのモデルも Maia-1100（人間のアマチュアレベル）を凌駕しておりませんでした。また、いくつかのモデルはランダムなプレイに敗北したという事実もあります。さらに、我々のファインチューニングした Qwen3-8B は性能を大幅に改善し、より大きな最前線の推論モデルにも近づいています

Original Content

arXiv:2509.24239v4 Announce Type: replace-cross Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.