arxiv_cs_ai 2026年4月24日

LogicEval: 現実世界のソフトウェアにおける論理的脆弱性の自動修復技術を系統的に評価するための枠組み

LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

Open original article

Translated: 2026/4/24 20:35:20

logic-evalprogram-repairlogical-vulnerabilitieslarge-language-modelssoftware-security

Japanese Translation

arXiv:2604.12994v2 Announce Type: replace-cross 論理的脆弱性は、メモリ安全性に起因するものではなく、プログラムロジック上の欠陥に起因するため、重大なセキュリティ故障をもたらす可能性があります。既存の自動プログラム修復技術は主にメモリ腐敗脆弱性の修復に焦点を当てており、脆弱コードと期待される動作の限定的な言語理解により、論理的脆弱性の修復においては困難に直面しています。一方、大型言語モデル（LLM）がコードの理解と修復において示した最近の成果は有望です。しかし、まだそのような技術の論理的脆弱性に対する能力と限界を分析する枠組みは存在しません。本稿では、現実世界の論理的脆弱性を解決する従来のアプローチおよび LLM 基づく修復手法を系統的に評価することを目的とします。評価を促進するために、我々は実体のあるセキュリティ影響を反映する 122 の論理的脆弱性を含む史上初のデータセット LogicDS を作成しました。さらに、論理的脆弱性のパッチを評価するための系統的な枠組み LogicEval を開発しました。評価结果显示、コンパイルおよびテストの失敗は主にプロンプトの感度、コード文脈の喪失、およびパッチのローカライズの困難性によって駆動されていることが示されました。

Original Content

arXiv:2604.12994v2 Announce Type: replace-cross Abstract: Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with logical vulnerabilities because of their limited semantic understanding of the vulnerable code and its expected behavior. On the other hand, recent successes of large language models (LLMs) in understanding and repairing code are promising. However, no framework currently exists to analyze the capabilities and limitations of such techniques for logical vulnerabilities. We aim to systematically evaluate both traditional and LLM based repair approaches for addressing real world logical vulnerabilities. To facilitate our assessment, we created the first ever dataset, LogicDS, comprising 122 logical vulnerabilities that reflect tangible security impact. We also developed a systematic framework, LogicEval, to evaluate patches for logical vulnerabilities. Evaluations suggest that compilation and testing failures are primarily driven by prompt sensitivity, loss of code context, and difficulty in patch localization.