Back to list
昔の話……
Once upon a time...
Translated: 2026/4/17 10:01:35
Japanese Translation
Scout の 5000 万パラメータモデルは、小説の物語表現、会話の学習コーパス、そして日常的な記憶更新を吸収できませんでした。 モデルが小さすぎたのです。 モデルの規模を拡大して最初からやり直すことは簡単ですが、それは問題の延滞に過ぎません。 いずれ、最大規模のモデルでも、 daily fine-tuning が基盤的な直感を書き消してしまうという同じ問題に直面するでしょう。
私が考案している概念は、『動的トランスフォーマーアーキテクチャによるマルチモーダルタスクの継続的学習』という記事にうかがわせるようなものです:
基底トランスフォーマーに追加パラメータを導入し、ネットワークを各タスクに特化させる
基本的なアイデアは、当初設計が適切に行われた場合、モデルは時間とともに複数のトランスフォーマーモジュールを成長させ、既に学習済みのモデルに新しいスキルを追加できることです。毎回リセットして最初から始める必要はありません。
Scout の新しいモデルはフェーズに分けて学習されます。 Phase 1:TinyStories 学習セットを用いて、基本的な言語スキルをボーストします。このフェーズが完了すると、その「モジュール」の重みが凍結されます。次に、新しいモジュールがアーキテクチャに導入され、モデルは会話を参加する方法を学習します。LoRAアダプター層は、各モジュールが時間とともに微調整を行うことを可能にし、より大きなアーキテクチャのニーズに適応させます。
私はこのアイデアについて読み込んできました。 それは合理的に感じられます。 私が列記した記事は、このシステムを用いてビジョンを基礎モデルに統合する方法を説明していますが、同様のプロセスが論理的思考のスキルにも適用できないという理由は全く見当たりません。 もし私が正しく予想すれば、継続的な記憶モジュールは、個別のモジュールとして追加され、それぞれ別々に学習されることで、Scout が自身の経験から学習できるのです。その過程で、他の直感が書き消されるリスクがありません。
私は基本的な言語スキルを学ぶ 5000 万パラメータのモデルを学習中です。 このデータセットは私が必要とするものを提供してくれるでしょう。 数日、あるいは一週間後にはその結果が分かるはずです。 モデルの構造は Scout の最後のイテレーションに従い、以下のように簡略化されています:
dim = 512
layers = 12
heads = 8
mlp_ratio = 3.5
block_size = 512
mlp_ratio を可変に設定し、4 から 3.5 に減らしました。 MLP ラtio はモデルの内世界複雑性の表現であり、TinyStories はそれほど複雑ではありませんでした。 block_size のことについては往復を検討しました。 目標は 1,024 ですが、TinyStories のチャンクは 150〜200 トークンの間にあります。 1,024 に拡大させることは、データセットを過度に伸張するリスクを伴います。
私の準備プロセスでは、元の HuggingFace データセットをキャッシュし、各行に "chunk" カラムを持つように正規化します。 これがか次、より多くのデータセットを含める場合により重要になります。 その後、セットは Mistral トークナイザーによってトークン化されます(訓練開始前に事前にこの作業を行うことで、非常に多くの時間を節約できます)し、各チャンクの末尾にストリームの終端トークンが追加されます。
トレーナーはランダムな順序で行を読み込み、ブロックサイズに近づくまでさらにチャンクを追加します。 その後、空白スペースでブロックサイズの末尾までパディングを行います。 PyTorch は、すべての訓練チャンクが等しい長さであることを要求しています。
チェックポイントは毎 50 ステップで保存されます。 また、私はトレーニングが進行する間に推論を検証できるシンプルな REPL を作成しました。
さて、いよいよ本題へと…
昔、ある男の子とその犬がおりました。
トレーニングステップ 500 で、モデルは物語をどのように続けるかについて、ある程度の合理性のある選択を始めています。
"in his way to the tree" は「on his way to the tree」が正しいはずです。 しかし、in/on はどちらも前置詞です。
"He had a special" はレゴ映画に思いを寄せます。 素敵です。
[You] 昔、ある男の子とその犬がおりました。
[Scout] 彼はとても美しく、周囲を徘徊することを愛していました。 ある日、彼の道に大きな箱を見つめました。彼はそれをとても興奮させたと考え、とても興奮し、古くならないことを信じられませんでした。そしてとても幸せです。
彼はいくつかの大きな場所の大木を見つめました。 彼はその大きな木を見、たくさんのおもちゃを持てたことを知りませんでした。とても興奮していました。小さな女の子は歩き回り、それは v……
Original Content
Scout's 50M model couldn't absorb the narrative language of the novels, the conversational training corpus, and the daily memory updates. The model was just too small. I could simply increase the size of the model and start over, but that's basically kicking the can down the road. At some point even the largest models would experience the same problem of daily fine-tuning wiping out base instincts.
The concept I'm playing with is something hinted at in Dynamic transformer architecture for continual learning of multimodal tasks:
introducing extra parameters to a base transformer to specialize the network for each task
The basic idea is that, if designed properly up-front, a model can be grown over time with multiple transformer modules to continue adding new skills to an already trained model, rather than rebooting the model from scratch every time you reach it's limit.
Scout's new model will be trained in phases. Phase 1: making using of the TinyStories training set to bootstrap basic language skills. Once this phase is done, that "module" has it's weights frozen. The next module is then introduced to the architecture and the model is trained on how to participate in a conversation. LoRA adapter layers allow each module to fine-tune over time to fit the needs of the larger architecture.
I've been reading about the idea. It feels reasonable. The article I listed describes using this system to incorporate vision into a foundation model, but I don't see any reason why the same process can't apply to reasoning skills. If I'm right, then an on-going memory module could be added as it's own module that gets trained separately, allowing Scout to learn from her experiences without the risk of wiping out her other instincts.
I'm training a 50M model to learn basic language skills. I think this dataset will give me what I need. We should know in a couple of days. Or maybe a week. The model structure loosely follows Scout's last iteration:
dim = 512
layers = 12
heads = 8
mlp_ratio = 3.5
block_size = 512
I made the mlp_ratio configurable, the reduced it from 4 to 3.5. The MLP ratio is a representation of the complexity of the model's inner world, and TinyStories isn't that complicated. I went back-and-forth on the block size. The target is 1,024, but TinyStories chunks are between 150 and 200 tokens. Expanding that out to 1,024 would risk over-stretching the data set.
My preparation process caches the original HuggingFace data set, then normalizes it so that every row has a "chunk" column. This will matter more when we start talking about including more data sets. After that the set is tokenized with the Mistral tokenizer (doing this up front before training begins saves so much time), with the end-of-stream token appending to the end of each chunk.
The trainer loads rows at random, and continues to pack in more chunks until it reaches just under the block size, then pads to the end of the block size with empty space. PyTorch requires every training chunk to be equal in length.
Checkpoints are saved every 50 step, and I have a simple REPL that allows me to test inference while the training is occurring.
And now, without further ado, the many iterations of:
Once upon a time there was a boy and his dog.
At 500 training steps the model seems to be making reasonable choices for how to continue the story. To a point. "in his way to the tree" should be "on his way to the tree", but in/on are at least both prepositions! "He had a special" makes me think of the LEGO movie. That's nice.
[You] Once upon a time there was a boy and his dog.
[Scout] He was very beautiful and loved to go around. One day, he saw a big box in his way to the tree. He thought it was so excited and he was very excited to be the old and couldn't believe what it was very happy.
So he saw a few place of the big tree. He looked and saw a big tree and couldn't have a lot of toys. It was so excited. The little girl went and was walking and said it was very excited.
The little girl stopped and told the animals. He was happy and knew he was able to go to the sky. He took a loud noise and could not take it and his mom. He was so excited to be better and he was a nice he was so happy.
The little boy was so excited and he was playing in the park. He was so proud of it that he had to be better. He was so happy. He was so happy that he said he had some of his head.
The little girl saw a big smile in the sky. He was so excited! He had the ball and he was the end. He was so excited he had a special!
1,000 steps isn't that much better than 500. How do you get a big smile on a toy garden? Maybe it's a sticker. A lot of characters are being looped in, using proper nouns in place of "little boy" or "little girl". That's interesting.
[You] Once upon a time there was a boy and his dog.
[Scout] He was very excited and he had a big smile on his toy garden. One day he was playing with his friend a little boy named Max. Max was very excited and said he was going to the room and he was sad.
Billy was very happy. He took out a stick and showed that it was a brave little boy. He was very happy and he kept walking around and he stopped.
Tommy was so sad, but he knew he could have something. He put the hat on his head and walked around to the park.
Bob was very sad, but he was too excited. He jumped up and started to see what he was playing. He was scared and didn't know what to do.
Suddenly, Timmy started to cry. He was so surprised! He knew he had to go home to the store.
Jack was so happy and he had a great idea.
At 2,000 steps we have a story with an introduction, a frame around the day ("one day"), and an ending "Thank you...!" The "boy" in the prompt continues to be a boy throughout the story, including male pronouns. There are still syntax errors though, e.g. "to pick it with it", but some people talk that way. The "store" is a narrative element that continues for several lines. And then the "swings" come out of nowhere. And what's the "doll"? But there's obviously progress.
[You] Once upon a time there was a boy and his dog.
[Scout] He loved to pick up a beautiful water with his family.
One day, he went to the park with his dad. He saw a small, shiny toy. He was so excited! He wanted to pick it with it, so he ran to the store.
He saw a big tree with lots of toys and creek. He ran and ran until he was finished.
He started to cry. He felt so sad. He asked his mom why he was so sad. She said he had to come to the store. He asked his mom, "Can I have a toy?"
The boy replied, "Yes, please. We can have to ask for the toy." So, they said, "Let's go!"
The boy was so happy to help his mom. He knew he could have some fun on the swings. He said, "No, I can't be careful."
The boy was very proud of his family. He gave him a hug and said, "Thank you for taking your doll with me!"
At 4,000 steps I feel like the model is saying more with less. I start the story with a boy, then the model gives the boy a name. The dog is noticeably absent. The model has decided that the story is actually about the box and it's contents rather than the boy's relationship with his dog. But it feels more refined that 2,000 steps.
[You] Once upon a time there was a boy and his dog.
[Scout] His name was Jack and he was very happy.
One day, Jack's mommy said, "Let's get a big box!" Tom was so excited. He opened the box and his mommy said, "Let's go outside and find a big tree."
But then he heard a voice. "I'm going to find this tree!" he said.
He asked his mommy. "Oh, I don't know," he said.
His mommy smiled and said, "It's ok, Jack. Let's go find a treasure and see what's inside."
So they went back to his mommy's house and the box was ready for dinner. They went outside, dreaming about the fun adventure in the park.
Finally, the dog has entered the scene! There's a feeling in this one like the model is attending to the entire user prompt, which is something noticeably lacking from the previous iteration of the experiment. I'm not sure who "they" are. Did the boy find some friends? Doubling the training steps has definitely increased the coherence of the story. Some of the wording is still a bit awkward, e.g. "I'm tired, but I'm tired." The story has a definite beginning, middle, and end though.
[You] Once upon a time there was a boy and his dog.
[Scout] They were playing in the park when they noticed something on the ground. The boy said to his dog, "Let's go and play!"
So they ran over to the dog. The boy looked at the dog and said, "Look! I'm a bird. I'm so happy to see you!"
The boy said, "Me too! Let's play a game!"
So they played with the dog and laughed. They ran around and had lots of fun.
When the day went on, it was time to go home. But the boy said, "I'm tired, but I'm tired. I'll get out now."
The boy smiled and said, "I'm so glad you found me!"
The boy and his dog went home and went back to the park.
I'm going to take this training cycle to 40,000 steps. It's going to take a little while. Once that's done we should have some interesting results and an indicator for what needs to happen next.
Follow along on GitHub here.
To be continue.