arxiv_cs_ai 2026年2月10日

大規模言語モデルを用いた性質判断テストの自動化アイテム生成

Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models

Translated: 2026/2/14 7:10:22

Japanese Translation

性格評価におけるシチュエーションジャッジメントテスト (SJT) は、従来型のリケート型自己報告尺度よりもユニークな利点を持っていますが、開発のプロセスには労力が多く、時間がかかる一方で専門の知識が必要です。大規模言語モデル（LLM）の最新の advancements は、自動化アイテム生成 (AIG) の可能性を示しています。本研究ではこれらの進歩を利用して、構造化され固有化可能なフレームワークを開発することに重点を置きました。GPT-4とChatGPT-5 を用います。三次元の実験が行われました。第1回の実験では、Promptデザインおよび温度設定の影響に対してLLM生成されたアイテムの内容的有効性を統計的に比較し、優れた LLMBaselined AIG アプローチを開発しました。GPT-4で最適化したプロミンと温度1.0 达成によって生じます。第2回の実験では、この自動的な SJT生成アプローチが複数のラウンドを通じてシームレスに適用できるかをチェックしました。結果は、チャットgPT-5でも安定した品質のアイテムを引き出すことができました。第3回の実験では、バッファと五元人格の5つの要素に対して LLMBasetred SJTで効果的なメンタリックス属性を評価しました（効果に対する妥当性と標準）。結果は多数の段階に対する効果的および信頼性が確認されましたが、合意フォールトに基づく妥当性との整合性に問題があることを見いだされました。これらのデータは新しい LLMBaselined AIG アプローチにおいて文化適応性とメンタリックスの有効性を保証するように提案しています。従来のアプローチに比較して効率的な SJT を生成するという利点があります。

Original Content

arXiv:2412.12144v4 Announce Type: replace-cross Abstract: Personality assessment through situational judgment tests (SJTs) offers unique advantages over traditional Likert-type self-report scales, yet their development remains labor-intensive, time-consuming, and heavily dependent on subject matter experts. Recent advances in large language models (LLMs) have shown promise for automatic item generation (AIG). Building on these developments, the present study focuses on developing and evaluating a structured and generalizable framework for automatically generating personality SJTs, using GPT-4 and ChatGPT-5 as empirical examples. Three studies were conducted. Study 1 systematically compared the effects of prompt design and temperature settings on the content validity of LLM-generated items to develop an effective and stable LLM-based AIG approach for personality SJT. Results showed that optimized prompts and a temperature of 1.0 achieved the best balance of creativity and accuracy on GPT-4. Study 2 examined the cross-model generalizability and reproducibility of this automated SJT generation approach through multiple rounds. The results showed that the approach consistently produced reproducible and high-quality items on ChatGPT-5. Study 3 evaluated the psychometric properties of LLM-generated SJTs covering five facets of the Big Five personality traits. Results demonstrated satisfactory reliability and validity across most facets, though limitations were observed in the convergent validity of the compliance facet and certain aspects of criterion-related validity. These findings provide robust evidence that the proposed LLM-based AIG approach can produce culturally appropriate and psychometrically sound SJTs with efficiency comparable to or exceeding traditional methods.