arxiv_cs_ai 2026年4月20日

Capture the Flags: Semantic-Conserving Transformationsを用いたアジェント型 LLM の家族ベース評価

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Translated: 2026/4/20 11:18:31

agentsllm-evaluationcybersecurityctfcode-transformations

Japanese Translation

arXiv:2602.05523v2 Announce Type: replace-cross 概要: アジェント型大規模言語モデル（LLM）は、現在サイバーセキュリティタスクにおいて、Capture the Flag（CTF）ベンチマークを使用して評価されています。しかし、既存の点ごとのベンチマークは、ソースコードの代替バージョンに対するエージェントの堅牢性と一般化能力に関する限られた洞察しか提供していません。我々は、セマンティクス保持型プログラム変換を用いて単一の CTF チャレンジから、セマンティックに等価なチャレンジのファミリーを生成する CTF challenge families を導入しました。これにより、背景にある攻撃戦略を固定のまま、堅牢性の制御された評価が可能となりました。Evolve-CTF というツールを提示します。これは Python チャレンジから範囲の広い変換を用いて CTF ファミリーを生成するツールです。Cybench および Intercode チャレンジからのファミリーを導出するために Evolve-CTF を使用し、ツールアクセスを備えた 13 つのアジェント型 LLM コンフィギュレーションを評価しました。我々の発見は、モデルは改名やコード挿入に対して驚くほど堅牢であるが、合成された変換と深いオブフェスコーションはより洗練されたツール使用を必要としてパフォーマンスを低下させることにあることです。明示的な推論の有効性は、成功率にほとんど影響を与えません。我々の作業は、将来の LLM 評価のための手法とツール、およびこの分野の現在の最前線のモデルの能力を特徴づける大規模なデータセットに貢献しました。

Original Content

arXiv:2602.05523v2 Announce Type: replace-cross Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used to generate a family of semantically-equivalent challenges via semantics-preserving program transformations, enabling controlled evaluation of robustness while keeping the underlying exploit strategy fixed. We present Evolve-CTF, a tool that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to renaming and code insertion, but that composed transformations and deeper obfuscation degrade performance by requiring more sophisticated tool use. Enabling explicit reasoning has little effect on success rates. Our work contributes a technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.