arxiv_cs_ai 2026年2月10日

SafeDialBench：多ターンダイアlogueでの多種類の脱獄攻撃に対するLarge Language Model (LLM) の細部にわたる安全性評価 Bench

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Open original article

Translated: 2026/2/14 7:11:13

Japanese Translation

Abstract: 大規模言語モデル（LLMs）の急速な進化とともに、LLMsの安全性は重要な課題となっています。現在のベンチマークは主に単ターンダイアlogueの評価や単一の脱獄攻撃法に対応していますが、これらのベンチマークではLLMの危険情報を詳細な見極めと取り扱いに関する能力を踏むことがありません。そのため、この問題に対する解決案となるためには多様な脱獄攻撃によって対応可能な各LLMの安全性を評価する細かいベンチマーク「SafeDialBench」というものが求められているわけですね。具体的には、6つの安全に関するドリーティクスを構成し、複数ターンのディーゼルが生成される2段階のハイラーダイナミックな安全分類方法をデザインしています。また7種類の脱獄攻撃を駆使したため高品質のディーゼルデータセットが生成され、各LLMの不適切情報の識別と扱いに関する能力に加え、複雑な脱獄攻撃に対する継続性にも注目しています。実験結果から、「Yi-34B-Chat」と「GLM4-9B-Chat」は安全性が高いパフォーマンスを示し、「Llama3.1-8B-Instruct」や「o3-mini」などのLLMの一部に危険性を持っているという結論が導かれています。

Original Content

arXiv:2502.11090v4 Announce Type: replace-cross Abstract: With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.