arxiv_cs_ai 2026年2月10日

どのような名前が出てきますか？LLMに基づくアカデミック専門家推奨のベンチマークと介入型評価

Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

Translated: 2026/3/7 14:17:37

llmacademic-recognitionbias-awarenessbenchmarkinterference

Japanese Translation

大きな言語モデル (LLMs) は、 increasingly 学術的専門家の推薦に使用され始めています。現在の審査には、基本的に単独でモデルの出力に焦点を当てていますが、その結果多くのユーザの推論時点での介入を無視します。そのため、誤りのような拒否や非人間的な推測、不均一なカバーは、モデルの選択またはデプロイメントの決定によるかどうかが不明確です。 LLMScholarBench と名付けられた audit を用いた LLM 基にしたアカデミック専門家推奨のベンチマークは、多くのタスクに関してモデルインフラストラクチャ及びユーザの介入を統合しています。九つの指標により技術的な品質並びに社会への表現について評価します。我々は物理専門家の推薦に関するその benchmarks の実装をしてみましたが、温度変動や代表制限されたプロンプティング、再取得補助生成 (RAG) と呼ばれるウェブ検索を通じてモデルを22の LLMS を auditing しようとした結果、ユーザ介入は一貫性なく改善せず、新たな問題がその他の指標へ転ずることに。温度の高いものになると Validity、Consistency、Factuality の評価が低下することから、代表制限されたプロンプティングでは多様性だけを改善することが可能ですが Factuality には影響を受けやすいです。 RAGは技術的な品質だけ向上する一方で Diversity と Parity を削減しようとします。全体としてはユーザの介入が価値や可視性などを変換するかたちとなり、一般的な解消手段ではなくいます。我々はコード並びにデータを収集し、他の分野への応用を可能にするので他者の参照して利用できます。

Original Content

arXiv:2602.08873v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures both technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that end-user interventions do not yield uniform improvements but instead redistribute error across dimensions. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing a general fix. We release code and data that can be adapted to other disciplines by replacing domain-specific ground truth and metrics.