arxiv_cs_ai 2026年2月10日

RealFin: ユーザー任せた場合、LLMは金融問題を Reasoning するまでにどれだけよく対応できますか？

RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

Translated: 2026/3/7 11:38:23

transformernatural-language-processingmachine-learning-evaluation

Japanese Translation

確実な財務的な思考能力には、答えられるだけでなくその回答が正当であるかどうかを理解することが求められます。実際の財務上の問題は、多くの場合、明示的に語られていない基本的な予測に基づいて取り組まれています。このため、問題を解決に見えるよう見せていても、十分な情報がないままで最後の答えを出すことが困難になっているのです。我々はREALFINという双方向評価を紹介することになります。これは金融 reasoning の評価指標において、シチュエーション型質問の重要な前提を一様に削除しながらも、言語的な合理性が保持されるように設計されています。その上で、回答する能力をテスト、非見せられた情報を認識したり、解釈せずに排除できないオプションを拒否する能力をテストし、結果の評価を行うことで、重要な情報を無視するとモデルの成績が低下することを実証します。一括用語モデルは過剰にコミットしてしまうと予測されますが、多くの財務用語句の専門的モデルが非公式な前提を見分けられないことが発見されます。これらの一貫した結果からは、現在の評価において重要なギャップがあり、確実な金融モデルが質問すべきでない場合の理解が必要であることが明らかです。

Original Content

arXiv:2602.07096v1 Announce Type: cross Abstract: Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, problems often rely on implicit assumptions that are taken for granted rather than stated explicitly, causing problems to appear solvable while lacking enough information for a definite answer. We introduce REALFIN, a bilingual benchmark that evaluates financial reasoning by systematically removing essential premises from exam-style questions while keeping them linguistically plausible. Based on this, we evaluate models under three formulations that test answering, recognizing missing information, and rejecting unjustified options, and find consistent performance drops when key conditions are absent. General-purpose models tend to over-commit and guess, while most finance-specialized models fail to clearly identify missing premises. These results highlight a critical gap in current evaluations and show that reliable financial models must know when a question should not be answered.