arxiv_cs_ai 2026年2月10日

大規模言語モデルのシンティクスとセマンチックス理解：給与システムにおけるライフゲームアプローチ

Evaluating Semantic and Syntactic Understanding in Large Language Models for Payroll Systems

Translated: 2026/2/14 8:11:57

Japanese Translation

現在、大規模な言語モデルは日常的に書くことや検索したり分析したりするのに使用されています。また、このモデルの自然言語理解も徐々に改善されることになりますが、モデルは数値計算や直感的に確認できるような出力を生成することはまだ不安定であるとされます。我々は、財務システムというシナリオをクローズアップさせた重要で高リスクなケースとして選び出し、大規模言語モデルが給与の構造的理解をすることが可能なのか、規則への追加適用を行う場合、その順序も適切に行われているのか、そして得られた結果が1円ずれなく正確であるかどうかという点について検討します。我々の実験は基本から複雑なシナリオまで階層化されたデータセットを含みますし、基本的なベースラインからスケッチの種類の指示や概念に基づいたためのものから理由のある種類のプロンプトもまた含まれているためです。また、複数のモデル家族において実験も行われます(例：GPT、Claude、Perplexity、GrokとGeminiなど)。結果から言いますと注意深く指示を行うのは十分にうまくいく領域と、明示的な計算が要求される領域の規範が明らかにされています。この論文は言語モデルを適切な精度と信頼性で使用するための簡潔なフレームワーク及び実践的なガイドラインを提供します。

Original Content

arXiv:2601.18012v2 Announce Type: replace-cross Abstract: Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying LLMs in settings that demand both accuracy and assurance.