Back to list
「ノー」を言わせるステアリング:Vision Language Modelsにおけるactivation steeringによるConfigurable Refusal
Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
Translated: 2026/2/11 4:24:12
Japanese Translation
arXiv:2602.07013v1 アナウンス種別: new
概要: Vision Language Models (VLMs) の急速な進展に伴い、refusal mechanismsは責任ある安全なモデル挙動を確保するための重要な要素となっている。しかし、既存のrefusal strategiesは概して one-size-fits-all であり、多様なユーザーのニーズや文脈上の制約に適応できず、under-refusal(過少拒否)や over-refusal(過剰拒否)を引き起こす。本研究ではまず上記の課題を検討し、activation steeringに基づく堅牢かつ効率的なconfigurable refusal手法である CR-VLM(Configurable Refusal in VLMs)を提案する。CR-VLMは三つの統合コンポーネントから構成される: (1) refusal signalを増幅するためにteacher-forced mechanismを用いてconfigurable refusal vectorを抽出すること、(2) in-scopeクエリに対する受理を保持してover-refusalを緩和するgating mechanismの導入、(3) 視覚表現をrefusal要件に整合させるcounterfactual vision enhancement moduleの設計。複数のデータセットおよび各種VLMsにわたる包括的な実験により、CR-VLMは有効で効率的かつロバストなconfigurable refusalを達成することが示され、VLMsにおけるユーザー適応型の安全整合に向けたスケーラブルな道を提供する。
Original Content
arXiv:2602.07013v1 Announce Type: new
Abstract: With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.