arxiv_cs_ai 2026年4月24日

音声からの変換基本文法：未訓練したディープニューラルネットワークにおける自発的な接続

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

Translated: 2026/4/24 20:31:29

unsupervised-learningspeech-processingneural-networkssyntax-evolutionciwgan

Japanese Translation

arXiv:2305.01626v4 Announce Type: replace-cross 摘要：構文の計算モデルは主にテキストに基づいています。ここでは、構文の進化における最も基本的な第一歩が、未訓練の方法で生の音声から直接モデル化できる可能性を提案します。私たちは、構文の最も普遍的で基本となる一部操作である「接続」に焦点を当てます。私たちは、「自発的な接続」という現象を導入します：これは、個々の単語の音響記録で訓練された ciwGAN/fiwGAN モデル（畳み込みニューラルネットワークに基づくもの）が、トレーニングデータに複数の単語が含まれることなく、2 語か 3 語が接続された出力を開始してしまうという現象です。私たちは、異なるハイパーパラメータとトレーニングデータを用いて独立して訓練された複数のモデルでこの発見を再現しました。さらに、2 つの単語で訓練されたネットワークは、観察されていない新しい単語の組み合わせに単語を埋め込むことを学習します。また、我々は接続された出力が構成的性の前駆体を含んでいることを示しました。我々の知る限り、これは ciwGAN/fiwGAN 設定で生の音声を訓練した CNN の以前に報告されていない特性であり、これらのアーキテクチャの学習の理解、そして脳における構文とその進化のモデル化、両方に波及効果があります。さらに、私たちは「抑制解除（disinhibition）」という神経機構を提案し形式化し、接続と構成的性への可能的人工および生物学的神経経路を描き出し、私らのモデル化が口語言語処理の生物学的および人工的な神経処理の検証可能な予測の生成に有用であることを示唆します。

Original Content

arXiv:2305.01626v4 Announce Type: replace-cross Abstract: Computational models of syntax are predominantly text-based. Here we propose that the most basic first step in the evolution of syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and elementary suboperations of syntax -- concatenation. We introduce \textit{spontaneous concatenation}: a phenomenon where a ciwGAN/fiwGAN models (based on convolutional neural networks) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the training data. We replicate this finding in several independently trained models with different hyperparameters and training data. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. We also show that the concatenated outputs contain precursors to compositionality. To our knowledge, this is a previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on raw speech and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution in the brain from raw acoustic inputs. We also propose and formalize a neural mechanism called \textit{disinhibition} that outlines a possible artificial and biological neural pathway towards concatenation and compositionality and suggests our modeling is useful for generating testable predictions for biological and artificial neural processing of spoken language.