arxiv_cs_ai 2026年2月10日

Fin-RATE: Large Language Models (LLM) の投資信託決議文解析評価の新たな指標

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Translated: 2026/3/7 12:27:30

llmfinancial-analysisregulatory-disclosuresperformance-evaluation

Japanese Translation

Large Language Model（LLM）が金融産業における日益増の普及によって，複雑な規制発行を解釈するためには越来越期待されます。しかし、既存のベンチマーケットは多数の要素に焦点を当てることが多く，専門的な分析が要される情報に基づいて複数のドキュメントと報告期間間し、それらの会社の比較を行うプロフェッショナルな観点からは描かれていないことが多いです。これらは、データリテンションの失敗、内容生成の問題、財務に関する判断ミス、または要件の理解不足といった問題が原因でエラーが生まれるケースにどのような属性があるのかを区別していません。これはエピックバンドと不確実性への対し難さを与えます。そのためこれらの障壁で詰まった点を行わず評価するために、我々はFin-RATEと命名しました。SEC（Investing Information and Reporting Service）の決議文に基づいて建設しているベンチマーケットで、専門的分析のプロセスを模倣します。個々の公開を絞った理論、特定のトピックに関係して異なる会社間での比較、並びに報告期間内で同じフirmsについての連続的な観察。我々は17の最も一般的なLLMのベンチマーケットを行い、オープンソースからクローズドソースまでそして財務特化型モデルも含めました、これらベンチマーケットを行うのは、実際の事前テキストと再現の補足を含む状態で行われます。結果はすばらしいパフォーマンス低下とともに示されました。その作業が単一のドキュメントからの理論から同じ会社の比較に続いて時間を追跡するとき、精度は失われ18.60％まで下がり。また、同一トピックを跨ぎながら他のフirmsに関するタイムやエージンスに関連してのミスはその精度がさらに低下14.35%も下がったのです。これらの問題は以前の評価ベンチマーケットにより形式化または量化されていませんでした。

Original Content

arXiv:2602.07294v1 Announce Type: cross Abstract: With increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. They do not distinguish whether errors stem from retrieval failures, generation flaws, finance-specific reasoning mistakes, or misunderstanding of the query or context. This makes it difficult to pinpoint performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirror financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This is driven by rising comparison hallucinations, time and entity mismatches, and mirrored by declines in reasoning and factuality--limitations that prior benchmarks have yet to formally categorize or quantify.