arxiv_cs_lg 2026年4月20日

Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence

Translated: 2026/4/20 11:05:16

maximum-entropypopulation-synthesispersistent-contrastive-divergencestochastic-optimizationurban-simulation

Japanese Translation

arXiv:2603.27312v2 Announce Type: replace Abstract: 最大熵（MaxEnt）モデリングは、個人レベルのマイクロデータを利用できない状況下で、集合的な統計データから合成人口を生成するための原理に基づいた枠組みを提供します。正確な列挙アプローチのボトルネックは、全ベクトル空間 $\ackslash\mathcal{X}$ に対する明示的な合計を用いた期待値計算で、カテゴリー属性が $K \\approx 20$ 個を超えるとなかなかの実行不可能になります。サンプリングに基づく代替手法が存在しますが、それらは Metropolis のようなスキームに基づいており、提案のチューニングと拒否ステップを必要とします。我々は、Persistent Contrastive Divergence (PCD) に基づく確率的代替手法である \emph{GibbsPCDSolver} を提案します。これは、各勾配ステップで Gibbs の Sweep を行い、N 人の合成個人のプールを更新するものであり、\\ackslash\mathcal{X} を物理的に展開することなしにモデルの期待値の確率的近似を提供します。我々は、統制されたベンチマークと、ISTAT の影響を受けた条件確率表から導出された解析的に正確な边际目標を持つ $K{=}15$ のイタリアの人口統計学ベンチマークである \emph{Syn-ISTAT}上で、このアプローチを検証しました。$K \\in \\{12, 20, 30, 40, 50\\}$ に対するスケーリング実験では、GibbsPCDSolver が $|\\backslash\mathcal{X}|$ が 18 桁も増加している状況で依然として \\MRE \\in [0.010, 0.018] を維持しつつ、実行時間のスケーリングを $O(K)$ にして $O(|\\backslash\mathcal{X}|)$ にしていることが確認されました。Syn-ISTAT 上で、GibbsPCDSolver はトレーニング制約に対して \\MRE{=}0.03 に到達し、かつ一般的ラッキングと比較して有効サンプルサイズ \\Neff = N を達成する上で不可欠な 86.8 倍の多様性優位性を備えた人口を生成しました。

Original Content

arXiv:2603.27312v2 Announce Type: replace Abstract: Maximum entropy (MaxEnt) modelling provides a principled framework for generating synthetic populations from aggregate census data, without access to individual-level microdata. The bottleneck of exact-enumeration approaches is expectation computation by explicit summation over the full tuple space $\cX$, which becomes infeasible for more than $K \approx 20$ categorical attributes; sampling-based alternatives exist but rely on Metropolis-type schemes that require proposal tuning and rejection steps. We propose \emph{GibbsPCDSolver}, a stochastic replacement for this computation based on Persistent Contrastive Divergence (PCD): a persistent pool of $N$ synthetic individuals is updated by Gibbs sweeps at each gradient step, providing a stochastic approximation of the model expectations without ever materialising $\cX$. We validate the approach on controlled benchmarks and on \emph{Syn-ISTAT}, a $K{=}15$ Italian demographic benchmark with analytically exact marginal targets derived from ISTAT-inspired conditional probability tables. Scaling experiments across $K \in \{12, 20, 30, 40, 50\}$ confirm that GibbsPCDSolver maintains $\MRE \in [0.010, 0.018]$ while $|\cX|$ grows eighteen orders of magnitude, with runtime scaling as $O(K)$ rather than $O(|\cX|)$. On Syn-ISTAT, GibbsPCDSolver reaches $\MRE{=}0.03$ on training constraints and -- crucially -- produces populations with effective sample size $\Neff = N$ versus $\Neff \approx 0.012\,N$ for generalised raking, an $86.8{\times}$ diversity advantage that is essential for agent-based urban simulations.