arxiv_cs_ai 2026年4月24日

方言と人口統計データ：明示的なユーザープロフィールと暗黙的な言語信号に基づく LLM バイアスの定量分析

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

Translated: 2026/4/24 20:24:04

llm-biassociolinguisticssafety-alignmentdialect-jailbreakdemographic-parity

Japanese Translation

arXiv:2604.21152v1 告知タイプ: クロス要旨: 最先端の大規模言語モデル（LLMs）が普遍的になった現在、多様な人口統計グループにわたって公平なパフォーマンスを担保することが不可欠になっています。しかし、これらの偏差が、明示的に表明されたアイデンティティそのものによるものなのか、それともアイデンティティの提示方法によるものなのかは依然として不明です。現実の対話において、ユーザーのアイデンティティは、複雑に組み合わさった複数の社会言語学的要因を通じて暗黙的に伝達されることが多くあります。この研究では、明示的に発表されたユーザープロフィールを持つプロンプトと、隠れた方言信号（例：AAVE、Singlish）を持つプロンプトを比較し、24,000 件を超える応答を含む因子デザインを採用して、これら複数の信号を分離しています。我々の結果は、LLM の安全性において、属する人口統計グループを主張することでではなく、そのように聞こえることでユーザーが「より良い」パフォーマンスを達成するという独特のパラドックスを見出しました。明示的なアイデンティティプロンプトは攻撃的な安全フィルターを活性化させ、拒絶率を増大させ、参考テキストに対する意味類似性を低下させる一方、暗黙的な方言のサインは強力な「方言ジェイルブレイク」をトリガーし、拒絶確率をほぼゼロに抑えながら、標準米語のプロンプトと比較して参考テキストに対してより高い意味類似性を達成します。しかし、この「方言ジェイルブレイク」はコンテンツのサンタイゼーションに関する決定的なセキュリティのトレードオフを導入します。我々は、現在の安全対齊技術が脆く、明示的なキーワードに過剰に依存しており、「標準」ユーザーに対して慎重なサンタイゼされた情報を提供するとともに、方言話者はよりサンタイゼされていない、生々しい、かつ潜在的に敵対的な情報環境を navigate することになるという分断されたユーザー体験が生み出されていることを発見しました。また、対等性と言語的多様性という間の根本的な緊張を示し、明示的な手がかりを超えて一般化できる安全メカニズムの必要性を強調します。

Original Content

arXiv:2604.21152v1 Announce Type: cross Abstract: As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.