dev_to 2026年4月25日

蛇の AI における CNN グリッドエンコーディング：公刊記録を更新し 2 倍のスコアへ

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

Translated: 2026/4/25 4:59:21

snake-aireinforcement-learningdeep-learningconvolutional-neural-networksgame-state-encoding

Japanese Translation

伝統的なスネークゲームのグリッドでは、各格子点が存在可能な状態は 4 つだけです：空、頭、身体、またはリンゴ。なぜか、すべてのスネーク AI 論文では、ゲーム状態を手動で選択された少数の数字に凝縮することで空間情報を見捨てているか、ネットワークが整理しなければならない生pixel データの下で実体のアイデンティティを埋もせているという事実があります。これは極めて非効率的です。解決策は「二元平面エンコーディング」です。これを用いることで、CNN ベースのモデルは 20×20 のグリッド上で 2.5 時間、単一の RTX 2070 で記録的なスコア 125 を達成し、公刊最高のスコア 62（平均スコアも常にこの記録より上）を 2 倍にしました。この投稿では、このエンコーディング、それが機能する理由、そしてなぜスネークの DRL（深度強化学習）空間の誰かがこれまでに試したことがなかったのかを探求します。深度強化学習の蛇に関する公刊文献は 2018 年から 2025 年まで範囲を広げ、状態表現のアプローチは 2 つに分かれています。 Camp 1（第一派）：手動設計された特徴ベクター。Sebastianelli et al. (2021) と Kommalapati et al. (2025) は、フルコネクトネットワークに 11 つの二元特徴を供給しています。これには、3 つの危険フラグ（前方、左側、右側 directly に壁または身体セグメントがあるか）、4 つの方向フラグ（スネークが現在向いている方向）、4 つのリンゴ相対フラグ（リンゴが頭の上方、下方、左側、右側にあるか）が含まれます。ネットワークはゲーム状態の前処理済みの要約を受け取ります。グリッドを見ることはなく、空間的関係も決して学習しません。人間が決めた重要事項がそのまま入力のエンコーディングに書込まれています。これはうまく機能します。Sebastianelli は、ヴァンラ DQN（深度 Q 学習）とこの 11 特徴表現を用いて 20×20 のグリッドで最高のスコア 62 を達成しており、非常にリソースを節約している…少なくとも初期段階では、しかしすぐにハードリミットに達してしまいます。ネットワークは空間的配置を見ていないため、空間パターンを発見して学習できません。また、特徴自体はスネーク固有です。その 11 つの二元値は、スネークの専門家が何が重要と考えるかを実装しています。これは他のどのゲームにも意味をなさなくなります。一つの環境を超えて一般化できるエージェントを実現したい場合、これは死胡同です。 Camp 2（第二派）：生ピクセル。Wei et al. (2018) と Tushar & Siddique (2022) はどちらもスクリーンショットから学習を行います。Wei は 64×64 の RGB フレームを 4 層重ね合わせ、入力として 64×64×12 を使用します。Tushar は 84×84 において非ゼロピクセルを 1 に変換（二元化）し、4 つのフレームを重ね合わせて 84×84×4 を得ます。ピクセルアプローチはゲームに依存しないため、その強みです。しかし、コストは大きいです。Tushar の二元エンコーディングでは、頭、身体、リンゴは単一の値として崩壊します。個々のフレームにおいて、占有されているすべてのセルは同一の姿になります。エージェントは、4 つの重ねられたフレームを跨って物がどのように動くかを監視することで、何が何かを割り出すことしかできません。単一のフレームそのものはゼロのアイデンティティ情報を含みません。Wei の RGB エンコーディングは色を保存するためアイデンティティを維持しますが、膨大な入力次元と冗長な空間解像度（20×20 の論理グリッドを 64×64 ピクセルで表現する）を代償とするのです。両方のピクセルアプローチは 12×12 のグリッドでテストされ、最高のスコア 17（Wei）と 20（Tushar）に達しました。どちらも 20×20 に適用されていませんでした。 peer-reviewed（査読された）文献以外の非公式プロジェクトは同様のパターンを示しています。GitHub 上の上付き学習アプローチ（Huynh, 2020）は、Keras ネットワークを用いて 7 つの手動設計された特徴を使用し、20×20 で最高のスコア 46、平均 22 を得ています。Medium 記事（Schoberg, 2020）は、学習された政策ではなく決定論的なアルゴリズムを比較し、衝突を避ける最短経路アルゴリズム（神经网络は一切関与しない）を用いて 67 を得、20×20 のグリッドで実行されています。全体を通じて、すべてのニューラルネットワークアプローチは、圧縮された特徴ベクターまたは生ピクセルグリッドのどちらかを使用しています。ここが私を驚かせた部分です。マルチチャンネルグリッドエンコーディングは新しいアイデアではありません。ボードゲーム AI の標準状態表現です。 AlphaZero（Silver et al., 2018）は、チェス、将棋、Shogi（将棋の日本版）をマルチチャンネル二元平面として表現します。各ピースタイプ、色、ゲームステート特徴は独自のチャネルを持ちます。ネットワークは、各チャンネルが異なる

Original Content

A traditional Snake game grid has only 4 states each grid point can be in: empty, head, body, or apple. And for some reason every published Snake AI paper either throws away spatial information by condensing the game state into a handful of hand-picked numbers, or buries entity identity under layers of raw pixel data that the network has to untangle. Incredibly wasteful. The solution? Binary Plane Encoding. Using it, a CNN-based model reached a record score of 125 on a 20×20 grid in 2.5 hours on a single RTX 2070, doubling the best published result of 62 (even the average is consistently above this record). This post explains the encoding, why it works, and explores why nobody in the Snake DRL space has tried it before. The published literature on deep reinforcement learning for Snake spans 2018 to 2025 and splits into two approaches to state representation. Camp one: hand-crafted feature vectors. Sebastianelli et al. (2021) and Kommalapati et al. (2025) both use 11 binary features fed to a fully-connected network. Three danger flags (is there a wall or body segment directly ahead, to the left, to the right), four direction flags (which way is the snake currently heading), and four food-relative flags (is the apple above, below, left, right of the head). The network receives a pre-digested summary of the game state. It never sees the grid. It never learns spatial relationships. A human decided what matters and encoded that decision directly into the input. This works well. Sebastianelli achieved a best score of 62 on a 20×20 grid with vanilla DQN and this 11-feature representation, and uses very little resources... at least initially, but then a hard ceiling is quickly reached. The network cannot discover and learn spatial patterns because it never sees the spatial layout. And the features themselves are Snake-specific. Those 11 binary values encode what a Snake expert thinks matters. They would be meaningless for any other game. If you want an agent that can generalise beyond a single environment, this is a dead end. Camp two: raw pixels. Wei et al. (2018) and Tushar & Siddique (2022) both train from screenshots. Wei uses 64×64 RGB frames stacked four deep, giving 64×64×12 input. Tushar converts to binary (any non-zero pixel becomes 1) at 84×84, also four frames stacked, giving 84×84×4. The pixel approach is game-agnostic, which is its strength. But the cost is significant. Tushar's binary encoding collapses head, body, and apple into a single value. In any individual frame, every occupied cell looks identical. The agent can only figure out what's what by watching how things move across four stacked frames: food stays still, the snake moves. A single frame on its own contains zero identity information. Wei's RGB encoding preserves colour and therefore identity, but at the cost of massive input dimensionality and redundant spatial resolution (64×64 pixels to represent a 20×20 logical grid). Both pixel approaches were tested on 12×12 grids, reaching best scores of 17 (Wei) and 20 (Tushar). Neither has been applied to 20×20. Beyond the peer-reviewed literature, informal projects show similar patterns. A supervised learning approach on GitHub (Huynh, 2020) uses 7 hand-crafted features with a Keras network and reaches a best of 46, average 22 on 20×20. A Medium article (Schoberg, 2020) compares deterministic algorithms rather than learned policies, reaching 67 on 20×20 with a collision-avoiding shortest-path algorithm (no neural network involved at all). Across all of it, every neural network approach uses either compressed feature vectors or raw pixel grids. Here is the part that surprised me. Multi-channel grid encoding is not a new idea. It is the standard state representation in board game AI. AlphaZero (Silver et al., 2018) represents chess, Go, and Shogi as multi-channel binary planes. Each piece type, colour, and game-state feature gets its own channel. The network receives a spatial tensor where every channel encodes a different semantic category of information about the board. MuZero extends this. The representation is well-established, well-understood, and has been proven at the highest levels of game AI. Snake fundamentally runs on a grid with set positions entities can occupy. It mirrors the exact class of problem where channel-per-entity encoding has proven effective, yet no published Snake DRL paper, and no self-published project I have found, attempts this representation. (Although this not appearing in published papers isn't surprising to me. As someone who this month had to go through over 2,100 papers, most papers just follow pre-existing trends.) All of the pre-existing Snake DRL literature either pre-computes features and discards spatial representation, or captures raw pixels and forces the network to spend capacity on visual processing before it can even begin to learn the game. This is the gap. Not a novel encoding technique, but an established one applied to a domain that has ignored it. The state representation is a 20×20×3 binary tensor. Three channels, each covering the full grid: Channel 0 (head): 1 at the head position, 0 everywhere else. Channel 1 (body): 1 at each body segment position, 0 elsewhere. Channel 2 (apple): 1 at the apple position, 0 everywhere else. Every value is exactly 0 or 1. A single frame provides complete, unambiguous game state. What is the head, where is the body, where is the food. No temporal stacking required. No entity disambiguation through motion inference. No feature engineering. The construction from game state is straightforward: import numpy as np def encode_state(grid_size, head_pos, body_positions, apple_pos): state = np.zeros((3, grid_size, grid_size), dtype=np.uint8) # Channel 0: head state[0, head_pos[0], head_pos[1]] = 1 # Channel 1: body for segment in body_positions: state[1, segment[0], segment[1]] = 1 # Channel 2: apple state[2, apple_pos[0], apple_pos[1]] = 1 return state That produces 20×20×3 = 1,200 values per state. Compare that to the pixel approaches: Tushar's binary encoding produces 84×84×4 = 28,224 values (23× larger), and Wei's RGB produces 64×64×12 = 49,152 values (41× larger). The grid encoding captures strictly more semantic information in a fraction of the space. The information hierarchy makes this concrete: Approach Entity identity per frame Full spatial layout Game-agnostic Binary Plane Encoding (this model) Yes, perfect Yes Partial (any grid game) RGB pixels (Wei et al.) Yes, via colour Approximate Yes Binary pixels (Tushar) No (needs 4 frames) Approximate Yes Feature vectors (Sebastianelli) Yes, pre-computed No No (Snake-specific) The only representation in the reviewed literature that provides perfect entity identity, full spatial layout, and game-agnostic structure without additional processing. The model processing this encoding is deliberately compact: Two convolutional layers with 32 and 64 channels respectively, 3×3 kernels with same padding, followed by a single MaxPool2d that halves the spatial dimensions from 20×20 to 10×10. Two dense layers of 512 and 256 units. Mish activation throughout. The network also uses a dueling architecture (separate value and advantage streams) and NoisyLinear layers replacing standard linear layers in the fully-connected head, providing learned exploration noise instead of epsilon-greedy. This is not a large network. It doesn't need to be. The compact input representation means the convolutional backbone doesn't need depth. Two 3×3 layers with a single pooling stage are sufficient to capture the spatial relationships that matter in a 20×20 grid: proximity to walls, body segment density in nearby regions, and relative food position. The encoding has already done the hard work of structuring the information. The CNN just needs to read it. The meaningful comparisons are grouped by grid size, since raw scores are not directly comparable across different board dimensions. The only published peer-reviewed result on a 20×20 Snake grid is Sebastianelli et al. (2021). They used an MLP with 11 hand-crafted binary features and vanilla DQN, testing 13 hyperparameter configurations across evaluation runs. Their best single score was 62. This work, using Binary Plane Encoding with a CNN and Rainbow DQN (incorporating C51 distributional output, dueling architecture, noisy exploration, prioritised replay, and 3-step returns), achieved a record of 125 on the same grid. over double. This isn't a cherry-picked peak. Across 55,000 episodes of sustained training, the rolling average holds between 60 and 70, and the median between 64 and 74. Sebastianelli's best single game of 62 sits below this model's average. The p10 floor (the score that 90% of episodes exceed) holds around 30, meaning even the worst games routinely outperform most published baselines. The p90 reaches into the high 90s, with individual episodes regularly breaking 100. Training to this point took approximately 2.5 hours on a single RTX 2070. An important caveat: this is not an encoding-only comparison. The improvement comes from changes across multiple axes simultaneously. State representation (grid encoding vs feature vector), architecture (CNN vs MLP), algorithm (Rainbow DQN vs vanilla DQN), and training scale (2048 parallel environments vs a smaller setup). The encoding is the enabling change that made the architecture and training scale feasible on consumer hardware, but the doubling should not be attributed to the encoding alone. Direct score comparison across grid sizes doesn't work because a 12×12 grid has a maximum possible score of approximately 141 food items versus approximately 399 for 20×20. Board coverage (score divided by maximum possible) provides a normalised metric: Work Grid Best Score Board Coverage Wei et al. (2018) 12×12 17 ~12% Tushar & Siddique (2022) 12×12 20 ~14% Sebastianelli et al. (2021) 20×20 62 ~16% This model 20×20 125 ~31% The gap persists across normalisation. At 31% board coverage, this approach covers roughly double the grid fraction of the nearest published result and more than double the pixel-based CNN approaches. For completeness: a supervised learning project (Huynh, 2020) on 20×20 achieved a best of 46, and a deterministic shortest-path algorithm (Schoberg, 2020) reached 67 on 20×20. The latter is not a learned policy. Neither is peer-reviewed. The encoding's advantage operates on two levels. Information quality. The network receives exactly the information it needs to play Snake, in a spatial format that CNNs are designed to process, with zero noise or redundancy. Each channel answers one question: where is the head, where is the body, where is the food. There is no ambiguity to resolve, no motion to infer, no irrelevant visual detail to filter out. Pixel inputs have a problem where the network must first learn to segment the image (such as determining what's the snake's body and what's the background). After this it then needs to learn to interpret the spatial relationships between the segments. With Binary Plane Encoding, this segmentation is pre-constructed, leaving the network to devote its entire capacity to learning the actual game instead of learning how to see in the first place. Information density. At 1,200 values per state stored as uint8, a replay buffer holding 1,000,000 transitions fits comfortably in approximately 1.6GB of VRAM. This made a GPU-resident replay buffer and 2048 parallel environments possible on a single RTX 2070 with 8GB of VRAM. For comparison, storing Tushar's 84×84×4 binary inputs at the same buffer capacity would need approximately 28GB. Wei's 64×64×12 RGB inputs would need approximately 49GB. Neither fits on consumer hardware. You would need multiple high-end GPUs or cloud infrastructure to achieve the same training scale with pixel-based inputs. The compact encoding didn't just improve information quality. It made the training infrastructure possible. 2048 parallel environments with a GPU-resident buffer meant the replay buffer reached useful diversity faster, the distributional RL gradient signal had richer data to work with, and the agent surpassed all previous records before reaching 100,000 training episodes. This encoding is a privileged state representation. The agent receives information extracted directly from the game's internal data structures: exact head position, exact body segment positions, exact apple position. A human player has access to the same logical information through visual perception, but this agent receives it pre-structured without any perceptual processing. The model plateaued at 125 (over 50,000 simulations without it budging), but a subsequent run using a variant algorithm has already broken that record, so we know this isn't the ceiling for the encoding. The more interesting question is whether pixel-based approaches could ever reach these scores given enough compute. Theoretically yes, but whether it's achievable in practice is unknown. Imperfections in the visual pipeline may compound through training, but that hypothesis hasn't been tested and the performance cost of segmentation quality on Snake hasn't been quantified. Whether the gap is recoverable or structural is an open question and one worth testing properly. If you take this on, I'd love to see what you find. Cross-paper comparisons to Sebastianelli et al. and the pixel-based approaches should be read with the privileged state in mind. The improvement reflects the combined effect of encoding quality, architecture, algorithm, and training scale. Isolating each factor's individual contribution is the purpose of the ablation study this encoding supports. Binary Plane Encoding is the foundation for a systematic ablation study on Rainbow DQN applied to Snake. The study adds one component at a time (Double DQN, noisy exploration, dueling architecture, prioritised experience replay, C51 distributional output), measuring each component's individual contribution in a dense-reward, vectorised-environment setting. Early results have already produced some surprises about which Rainbow components help and which ones hurt on a task like Snake. That is the next post. If you have experience with alternative state representations for grid-based game AI, or if you have seen Binary Plane Encoding applied to Snake in work I haven't found, I'd genuinely like to hear about it in the comments. This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper. Sebastianelli et al. (2021) - "A Deep Q-Learning based approach applied to the Snake game" - 29th Mediterranean Conference on Control and Automation (MED). DOI: 10.1109/MED51440.2021.9480232 Kommalapati et al. (2025) - "Building an AI Snake Powered by Deep Reinforcement Learning and Deep Q-Learning" - IEEE 7th International Symposium on Advanced Electrical and Communication Technologies (ISAECT). DOI: 10.1109/ISAECT68904.2025.11318716 Wei et al. (2018) - "Autonomous Agents in Snake Game via Deep Reinforcement Learning" - IEEE International Conference on Agents (ICA), Singapore. DOI: 10.1109/AGENTS.2018.8460004 Tushar & Siddique (2022) - "A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents" - IEEE 16th International Conference on Application of Information and Communication Technologies (AICT). DOI: 10.1109/AICT55583.2022.10013603 Silver et al. (2018) - "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play" - Science 362, 1140-1144. DOI: 10.1126/science.aar6404 Huynh (2020) - Supervised learning Snake AI. GitHub Repository Schoberg (2020) - Deterministic algorithms for Snake. Medium Article