dev_to 2026年4月24日

10 万接続の WebSockets スケーリング：リアルタイムクリケットアプリから学んだ教訓

Scaling WebSockets to 100k Connections: Lessons from a Real-Time Cricket App

Translated: 2026/4/24 22:01:49

websocketsscalabilitynodejsperformance-optimizationreal-time

Japanese Translation

ビラット・コホリがストライクに出ると、クリケット得点アプリのトラフィックは穏やかに増えるどころか、急激に跳ね上がります。一瞬で 5,000 人の接続ユーザーがいたのが、3 分後には 12 万人に、そして全ユーザーが次のボールについてプッシュ通知を受けたいのだ。このグラフは Xenotix Labs での初期のリアルタイムシステムを試みる最初の試みに失敗したことを物語っています。ここでは、再構築で学んだ教訓をご紹介します。私たちの最初のバージョン：1 つの Node.js プロセスに socket.io を実行し、すべての接続クライアントがライブ試合のすべてに購読していた形式でした。2,000 の同時接続では美しいパフォーマンスを発揮しましたが、15,000 になるとヘアービートが脱落し始め、40,000 になるとイベントループの遅延が 3 秒を超え、再接続の嵐が状況を悪化させました。灰の教訓：単一の Node プロセスの限界は、イベントループの負荷にもよりますが 2 万〜4 万のソケット程度です。単一のプロセスからすべてのクライアントへブロードカストを行うことは、イベントあたりの O(N) 処理であり、1 つの激しい試合が全ループを駆動します。再接続の嵐は本物です：ゲートウェイを再起動すると、すべての切断クライアントが約 2 秒以内に再接続し、自己による DDoS となります。私たちは 3 つの原則を取り囲んで再構築しました。第一に、WebSocket ゲートウェイノードは愚直でステートレスであり、接続のみを保持してメッセージを転送し、ビジネスロジックは持ちません。第二に、Redis pub/sub がバスの役を果たす—各ゲートウェイは試合_id でキー付けられた Redis チャンネルに購読し、スコア更新は 1 回だけパブリッシュされ、各ゲートウェイは自身の接続にファンアウトさせます。第三に、ALB 上のスティッキーセッション—クライアントはクッキーを通じて同じゲートウェイに再接続するため、接続ステートを無駄にせずに行えます。フロー：スコアプロバイダー → インゲットワーカー → Redis PUB match:123 → N ゲートウェイ SUB match:123 → WS クライアントへのプッシュ。スケーリングは現在水平型です：ゲートウェイノードを追加し、Redis がファンアウトします。単一の Redis クラスタは秒間に数十万の pub/sub メッセージを処理します。すべての WebSocket メッセージは状態のフルリフレッシュではなくデルタです。ボールが投局された際に、全スコアカードをプッシュするのではなく、{over: 14.3, runs: 4, batsman: "Kohli"} のようなデルタをプッシュします。理由：12 万人の接続において、200 バイトのデルタ対 4KB のショットは、ゲートウェイあたりのアウトバウンド帯域幅を 24 MB/sec と 480 MB/sec に変化させます。これは必要なインスタンスサイズを変えます。真の生産性を脅かす要因：2G 上のモバイルクライアントは各メッセージに対して 8 秒かかる ACK を取ります。これを処理しないと、サーバーは未 ACK メッセージをメモリにバッファリングし、最終的にノードプロセスを OOM (Out of Memory) させます。私たちのルール：クライアントが 5 秒 ACK していない場合は、最古のキューメッセージをドロップし、「resync」イベントを送ります。クライアントは REST エンドポイントから全スコアカードを再取得し、WebSocket を再開します。UX のわずかなひかえを交換してサーバーの安定性を確保します。ゲートウェイ再起動時に、クライアントの再接続遅延にランダムな 0–5 秒のジャッターを追加します。これなしでは、すべての N クライアントが同時に再接続し、ALB を壊し、これでは負荷がスムーズに分散しません。サーバー側では、ゲートウェイを順次停止：ALB は新しい接続を送らず、既存の接続は現在のメッセージを完了させ、その後プロセスが終了します。ローリングデプロイはイベントなしになります。複雑なダッシュボードを忘れてください。リアルタイムが健康かどうかを告げるのは 3 つの数値だけです：各ゲートウェイのイベントループ遅延（p99 が常に 50 ms 未満）、各ゲートウェイの接続数（25,000 個未満）、Redis pub/sub ファンアウト遅延（PUB から最終ゲートウェイの受信までの時間、100 ms 未満）。これらどれがずれても、ユーザーに気づかれ sebelum 再調整またはスケーリングを行います。最初から uWebSockets.js を使用してください—raw WebSocket 通量においてそれは socket.io の約 5 倍の効率があります。プロジェクトの途中に移行しましたが、最初から実行すべきだったことを後悔しました。より早期にロードシエディング機構を構築してください：システムが過負荷時に、「コメントリ」などの優先度の低いイベントは「ウicket」などの優先度の高いイベントよりも先にドロップし、すべてのメッセージを同等に扱わないでください。飛行機モードと 2G 変換でテストしてください—多くの WebSocket バグは安定状態ではなく、悪いネットワーク転換時に現れます。ゲートウェイ：Node.js + uWebSockets.js、ECS 上のコンテナ化バスのバス：ElastiCache 上の Redis pub/sub インゲッション：スコアプロバイダーから消費する Node.js ワーカークライアント：Delta-マージロジックを含む Flutter + Next.js ロードランサ：スティック付きの AWS ALB

Original Content

When Virat Kohli walks to the crease, traffic on a cricket scoring app doesn't climb gradually — it spikes vertically. One moment you have 5,000 connected users, three minutes later you have 120,000, and every single one wants a push notification on the next ball. That graph broke our first attempt at real-time at Xenotix Labs. Here's what we learned rebuilding it. Our first iteration: one Node.js process running socket.io, every connected client subscribed to every live match. It worked beautifully at 2,000 concurrent connections. At 15,000 it started dropping heartbeats. At 40,000 the event loop lag crossed 3 seconds and reconnection storms made everything worse. Lessons from the ashes: a single Node process caps out somewhere between 20k–40k sockets, depending on what else the event loop is doing. Broadcasting to all clients from a single process is O(N) per event — one hot match drives the whole loop. Reconnection storms are real: when you restart a gateway, every disconnected client reconnects within ~2 seconds, a self-inflicted DDoS. We rebuilt around three principles. First, WebSocket gateway nodes are dumb and stateless — they only hold connections and forward messages, no business logic. Second, Redis pub/sub is the bus — every gateway subscribes to Redis channels keyed by match_id; score updates are published once and every gateway fans out to its own connections. Third, sticky sessions on the ALB — client reconnects to the same gateway via cookie, so we don't thrash connection state. The flow: score provider → ingest worker → Redis PUB match:123 → N gateways SUB match:123 → WS push to clients. Scaling is now horizontal: add gateway nodes, Redis fans out. A single Redis cluster handles hundreds of thousands of pub/sub messages per second. Every WebSocket message is a delta, not a full state refresh. When a ball is bowled we push {over: 14.3, runs: 4, batsman: "Kohli"}, not the whole scorecard. Why: at 120k connections, a 200-byte delta vs. a 4KB snapshot is the difference between 24 MB/sec and 480 MB/sec of outbound bandwidth per gateway. That changes what instance sizes you need. A real production killer: a mobile client on 2G takes 8 seconds to ACK each message. If you don't handle this, the server buffers pending messages in memory, and eventually that buffer OOMs your Node process. Our rule: if a client hasn't ACKed in 5 seconds, drop the oldest queued messages and send a "resync" event. The client re-fetches the full scorecard from a REST endpoint and resumes the WebSocket. Trades a small UX hiccup for server stability. When a gateway restarts, add random 0–5 second jitter to the client's reconnect delay. Without it, all N clients reconnect simultaneously and crush the ALB. With it, the load spreads smoothly. On the server side, drain gateways gracefully: ALB stops sending new connections, existing connections finish their current messages, then the process exits. Rolling deploys become a non-event. Forget fancy dashboards. Three numbers tell you if real-time is healthy: event loop lag on each gateway (p99 under 50 ms, always), connection count per gateway (under 25k each), Redis pub/sub fan-out latency (time from PUB to last gateway receive, under 100 ms). If any of those drift, rebalance or scale before users notice. Use uWebSockets.js from the start — it's ~5x more efficient than socket.io for raw WebSocket throughput. We migrated mid-project and regretted not doing it day one. Build a load-shedding mechanism earlier: when the system is overloaded, drop low-priority events ("commentary") before high-priority ones ("wicket") — don't treat all messages equally. Test with airplane-mode and 2G emulation — most WebSocket bugs appear during bad-network transitions, not at steady state. Gateway: Node.js + uWebSockets.js, containerized on ECS Bus: Redis pub/sub on ElastiCache Ingestion: Node.js worker, consuming from the score provider Client: Flutter + Next.js with delta-merge logic Load balancer: AWS ALB with sticky sessions Whether it's live sports, collaborative editing, trading platforms, or real-time dashboards — scaling WebSockets is a discipline with sharp edges. If you're building in this space, Xenotix Labs has shipped real-time stacks that survive match-day India traffic. Reach out at https://xenotixlabs.com.