arxiv_cs_lg 2026年4月24日

クラウドデータウェアハウスにおける実行前のスロット時間予測：機能範囲に限定された機械学習アプローチ arXiv:2604.20145v1 発表タイプ: クロス

Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach

Translated: 2026/4/24 20:03:48

machine-learningcloud-computingbigquerydata-warehousecost-estimation

Japanese Translation

Cloud data warehouses（クラウドデータウェアハウス）は、使用したスロット時間に基づいて計算リソースを課金します。共有マルチテナント環境において、クエリのコストは高度に変動しており、実行前に正確に見積もりが困難なため、予算超過やスケジューリングの性能低下を引き起こしています。静的なクエリプランナーの直感手法は、複雑な SQL 構文、データのスケーリング、および工作負荷の競合を捉えきれません。本稿では、実行前の可観測シグナルのみを用いて BigQuery のスロット時間を実行前に予測する、機能範囲に限定された機械学習アプローチを提示します。使用されるシグナルには、SQL 演算子コストから導き出される構造化クエリ複雑性スコア、プランナーの見積もりや工作負荷メタデータに基づくデータボリューム特徴量、およびクエリテキストから派生したテキスト特徴量が含まれます。我々は、提出時点で無知不可である実行時要因（スロットプールの利用率、キャッシュ状態、実現されたスケーリング）を意図的に除外しました。このモデルは、ログ変換されたスロット時間を目标変数とした HistGradientBoostingRegressor（ヒストグラム勾配ブースティングレギュレーター）を使用し、TF-IDF および TruncatedSVD-512 テキストパイプラインを数値特徴量とカテゴリカル特徴量と融合させます。7 つのデプロイメント環境にまたがる 749 件のクエリで訓練され、2 つの保持環境からの 746 件のクエリで分布外評価を行ったところ、このモデルは全体ワークロードに対して MAE 1.17 スロット分、RMSE 4.71、解明変動率 74% を達成しました。コストが重大なクエリ（スロット時間 >= 0.01 分、N=282）においては、このモデルは MAE 3.10 を達成し、予測平均基準（4.95）や予測中位数基準（4.54）と比較して 30〜37% の改善を果たしました。一方、長い尾クエリ（>= 20 分、N=22）においては、このモデルが単純な基準を凌駕することはなく、これは「長い尾クエリは、現在の機能範囲に外れる未観測の実行時要因によって支配されている」という仮説と整合的です。実用的な改良として、複雑性に基づくルーティングを併用する二重モデルアーキテクチャについて説明され、長い尾ギャップを埋めるための方向性が将来の研究として特定されています。

Original Content

arXiv:2604.20145v1 Announce Type: cross Abstract: Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline fused with numeric and categorical features. Trained on 749 queries across seven deployment environments and evaluated out-of-distribution on 746 queries from two held-out environments, the model achieves MAE 1.17 slot-minutes, RMSE 4.71, and 74% explained variance on the full workload. On cost-significant queries (slot-time >= 0.01 min, N=282) the model achieves MAE 3.10 versus 4.95 for a predict-mean baseline and 4.54 for predict-median, a 30-37% reduction. On long-tail queries (>= 20 min, N=22) the model does not outperform trivial baselines, consistent with the hypothesis that long-tail queries are dominated by unobserved runtime factors outside the current feature scope. A complexity-routed dual-model architecture is described as a practical refinement, and directions for closing the long-tail gap are identified as future work.