arxiv_cs_cv 2026年4月24日

ARFBench: ソフトウェアインシデント対応における時系列質問回答能力のベンチマーク

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Translated: 2026/4/24 19:48:11

arfbenchtimeseriesqadatadogfoundation-modelsllm-evaluation

Japanese Translation

arXiv:2604.21199v1 Announce Type: cross 摘要: 時系列質問回答（TSQA），すなわち自然言語質問を発信して時系列の性質を推論し推理する技術は、基礎モデルにおいて有望でかつ研究が行われていない能力です。この作品では、ARFBench（時系列ベンチマーク）を提示し、ソフトウェアインシデントデータに起因する普遍的な時系列異常に対するマルチモーダル基礎モデル（FM）の理解度を評価します。ARFBench は、Datadog の内部テレメトリに由来する 63 つの生産インシデントから収集された 142 つの時系列と 5.38M データポイントを 750 つの質問に包含します。我々は主要な専用およびオープンソース LLM、VLM、時系列 FM を評価し、 Frontier VLM が既存のベースラインに比べて顕著に高いパフォーマンスを発揮していることが観察されました。leading モデル（GPT-5）は 62.7% の精度と 51.9% の F1 スコアを達成しました。次に、専門的なマルチモーダルアプローチの可能性を示しました。私達は、合成データと実データの小規模セットでポストトレーニングを行う、革新的な TSFM + VLM ハイブリッドプロトタイプを開発し、Frontier モデルと同等の総 F1 スコアと精度を得ました。最後に、モデルとドメイン専門家には補完的な強みがあることが判明しました。我々は、モデルと専門家の回答の最適な 2 選択 oracle を定義し、それが 82.8% の F1 スコアと 87.2% の精度をもたらしました。これにより、将来の TSQA モデルにおける新しいスーパーヒューマン frontier を確立しました。このベンチマークは、https://huggingface.co/datasets/Datadog/ARFBench で利用可能です。

Original Content

arXiv:2604.21199v1 Announce Type: cross Abstract: Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.