arxiv_cs_lg 2026年4月20日

市場研究のための大規模言語モデル：データ拡張アプローチ

Large Language Models for Market Research: A Data-augmentation Approach

Translated: 2026/4/20 11:06:21

large-language-modelsdata-augmentationconjoint-analysisstatistical-methodsmarket-research

Japanese Translation

arXiv:2412.19363v3 Announce Type: replace-cross Abstract：大規模言語モデル（LLM）は、複雑な自然言語処理タスクに優れ、人工知能を転換しました。人間似たテキストを生成する能力により、共結合解析（conjoint analysis）を含む市場研究において新たな可能性が開け、消費者の好理解解が不可欠でありながら、資源集約的であることが示されました。従来のアンケートベースの方法は、スケーラビリティとコストの面で制約があり、LLM 生成データは有望な代替策となります。しかしながら、LLM は実際の消費者行動をシミュレートする可能性を有する一方で、最近の研究では、LLM 生成データと人間データの間には重大なギャップが、かつ両者を相互置換する際に導入されるバイアスがあることが指摘されています。本稿では、このギャップに直面し、共結合分析において効率的に LLM 生成データを実データと統合する革新的な統計データ拡張アプローチを提案します。これにより、ナィブなアプローチ（人間データを単純に LLM 生成データの置換）がバイアスを悪化させることに対照的に、統計的に頑健な推定量を構築でき、かつ一貫性を持ち漸近的に正規的な性質を示します。我々は、推定誤差の有限サンプルパフォーマンス限界（finite-sample performance bound）も提示します。我々の枠組みを実証的に検証するために、COVID-19 ワクチンへの傾向に関する分析を実施し、推定誤差を軽減し、データとコストを 24.9% から 79.8% 削減する優れた能力を示しました。一方、ナィブなアプローチは、LLM 生成データと人間データの間に内在するバイアスのために、データ節約には失敗しました。スポーツカー選択に関する別の実証的研究も、我々の結果の頑健性を裏付けました。我々の知見は、LLM 生成データが人間反応の直接的な代替品ではないが、頑健な統計的枠組み内で利用される場合、貴重な補完役を果たせることを示唆しています。

Original Content

arXiv:2412.19363v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We further present a finite-sample performance bound on the estimation error. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.