arxiv_cs_lg 2026年4月24日

CEDAR: 代理型データサイエンスのための文脈エンジニアリング

CEDAR: Context Engineering for Agentic Data Science

Translated: 2026/4/24 20:09:32

cedaragentic-data-sciencellm-engineeringcontext-engineeringdata-science

Japanese Translation

arXiv:2601.06606v2 発表タイプ：置換摘要：私達は、代理型セットアップ（agentic setup）でデータサイエンス（DS）タスクを自動化するアプリケーションである CEDAR を示します。LLM を用いて DS の問題を解決することは未開拓の領域であり、膨大な市場価値を有しています。課題は多岐にわたります：タスクの複雑さ、データサイズ、計算上の制約、そして文脈の制限です。これらの課題は、有効な文脈エンジニアリングを通じて緩和できることを示します。まず、私達は DS 固有の入力フィールドを用いて初期プロンプトに構造を課し、これらは代理型システムのための指示として機能します。その解決策は、個別の LLM エージェントによって生成された別々の計画とコードブロックの列挙された連行セquences として具現化され、ワークフローのどの段階の文脈にも読みやすい構造を提供します。これらの中間テキストを生成し、対応する Python コードを生成するための関数呼び出しにより、データはローカルに保たれ、LLM プロンプトにはのみ関連する統計情報と指示のみが注入されます。反復コード生成とスマートな履歴レンダリングを通じて、障害耐性と文脈管理を導入します。私達の代理型データサイエンティストの現実可能性は、代表的な Kaggle 挑戦を用いて示されています。

Original Content

arXiv:2601.06606v2 Announce Type: replace Abstract: We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.