arxiv_cs_ai 2026年2月10日

LLM-FSM: RTLコード生成における有限状態の理由付けを行える大規模言語モデルをスケールアップ

LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

Translated: 2026/3/7 7:43:38

Japanese Translation

finite-state reasoning、単に状態依存行為を理解しとし実装する能力はハードウェア設計において重要な役割を果たします。この論文ではLLM-FSMというBenchmarkについて説明しています。これは大きな言語モデル（LLMs）が自然言語の詳細により有限型マシン（FSM）を修復しそれに正確なレジスタ・トランザクションレベル(RTL)の実装に変換することができるかを評価します。Priorの場合、自然言語と設計する形から手作業で例を作成したためのspecification-to-RTL benchmarkとは異なりLLM-FSMは自動化されたpipeline全体を通じて構築されます。まず、コンストリュケート可能な状態数と制約付きの遷移構造を持つFSMを構築し、さらに、FSMを構造的なyml フォーマットに形式的に表すLLMsをプロミップします。そしてそのymlファイルから正しいRTLとテストバッテリーは正しく合成されます。1000問題全ての妥当性はLLMベースとSATソルバーベースでチェックし、それに人間によるレビューがあります。「我々の実験によればFSMの複雑さが大きくなるにつれてLLMsの強度には大幅に降下があり、またsupervised fine-tuning (SFT)を用いた学習時間の伸ばし方がオイラーアウトタスクに対して有効であります。最後にLLM-FSMは今後モデル能力が高まるにしたがって状態FSMの複雑さも伸ばせるようになっています。

Original Content

arXiv:2602.07032v1 Announce Type: new Abstract: Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.