arxiv_cs_ai 2026年4月24日

教育分野における合成データ：従来型再サンプリングとディープ・ジェネレーティブ・モデルからの実証的知見

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Translated: 2026/4/24 20:22:47

synthetic-dataeducational-technologyresamplingdeep-learningprivacy-preservation

Japanese Translation

arXiv:2604.21031v1 Announce Type: cross Abstract: 合成データの生成は、教育技術におけるデータ不足とプライバシー懸念に対処する可能性をもたらしていますが、実務者は伝統的な再サンプリング手法と現代のディープラーニング手法の間の選択を行うための実証的ガイダンスが不足しています。本研究では、10,000 件の生徒成績データセットを用いて、これらパラダイムを比較する初めての体系的ベンチマークを提示します。3 つの再サンプリング手法 (SMOTE, Bootstrap, Random Oversampling) を、3 つのディープラーニングモデル (Autoencoder, Variational Autoencoder, Copula-GAN) と対比させ、以下のような多様な側面で評価します：分布の忠実度（Kolmogorov-Smirnov 距離、Jensen-Shannon 発散度）、機械学習の有用性（Train-on-Synthetic-Test-on-Real スコア（TSTR））、およびプライバシーの保護（Distance to Closest Record）。当社の発見は以下の基本的なトレードオフを明らかにしました：再サンプリング手法は極めて高い有用性（TSTR: 0.997）を実現しますが、プライバシー保護は完全に失敗します（DCR ~ 0.00）。一方、ディープラーニングモデルは強固なプライバシー保証（DCR ~ 1.00）を提供しますが、それが著しい有用性のコストを伴います。Variational Autoencoders (VAE) が最適な妥協点を達成することが判明し、予測性能を 83.3% 維持しながら完全なプライバシー保護を可能にしました。また、以下の実行可能な推奨事項も提供します：プライバシーが管理されている内部開発用は従来の再サンプリングを、プライバシーが最重要となる外部データ共有用は VAE を使用してください。本作業は、学習分析における合成データ生成における基礎的なベンチマークと実用的な意思決定フレームワークを確立しました。

Original Content

arXiv:2604.21031v1 Announce Type: cross Abstract: Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility such as Train-on-Synthetic-Test-on-Real scores (TSTR), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but completely fail privacy protection (DCR ~ 0.00), while deep learning models provide strong privacy guarantees (DCR ~ 1.00) at significant utility cost. Variational Autoencoders emerge as the optimal compromise, maintaining 83.3% predictive performance while ensuring complete privacy protection. We also provide actionable recommendations: use traditional resampling for internal development where privacy is controlled, and VAEs for external data sharing where privacy is paramount. This work establishes a foundational benchmark and practical decision framework for synthetic data generation in learning analytics.