arxiv_cs_lg 2026年4月24日

ThermoQA: 大規模言語モデルにおける熱力学推論を評価するための 3 つの階層を備えたベンチマーク

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Translated: 2026/4/24 20:01:34

thermoqalarge-language-modelsthermodynamicsllm-benchmarkingcoolprop

Japanese Translation

arXiv:2604.19758v1 Announce Type: cross 概要: 私たちは、3 つの階層に分類された 293 つのオープンエンドの工学熱力学問題からなるベンチマーク「ThermoQA」を提示します。これらの問題には、物性値の照会（110 Q）、コンポーネント解析（101 Q）、および完全サイクル解析（82 Q）が含まれています。真の解答（Ground truth）は CoolProp 7.2.0 を用いてプログラム的に計算され、水、R-134a、および変則比熱の空気を対象としています。6 つの最先端の LLM は、それぞれ 3 回の独立した実験を跨って評価されました。合成型ランクリングボードでは、Claude Opus 4.6（94.1%）、GPT-5.4（93.1%）、Gemini 3.1 Pro（92.5%）が最も高い得点を記録しました。階層間の性能低下は、Opus で 2.8 pp から MiniMax で 32.5 pp まで変化し、物性メモリーの有無が熱力学推論能力を反映しないことを確認しました。臨界状態の水、R-134a の冷媒、および複合サイクルガスタービンの解析は、40〜60 pp の性能スプレッドを持つ天然の識別子となります。マルチランの標準偏差範囲は +/-0.1% から +/-2.5% までであり、これが推論の一貫性を評価するための別軸であることを量化しました。データセットとコードは https://huggingface.co/datasets/olivenet/thermoqa でオープンソース化されています。

Original Content

arXiv:2604.19758v1 Announce Type: cross Abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa