dev_to 2026年4月25日

予測可能で復元性の高いアプリケーション開発のための supervisor-tree ライブラリ

A supervisor-tree library for building predictable and resilient programs

Translated: 2026/4/25 3:40:04

supervisor-treeerlang-otppythonfault-tolerancesystem-design

Japanese Translation

🙋‍♂️ 主張: Battle-tested idea（実戦で試練されたアイデア）ではなく、AI-slop（AI スロップ）。私はアーキテクチャを設計し、実装を主に独自に作成しました（>80%）。AI 支援は存在しますが、主にユニットテストとスタンドアロンユーティリティの作成に限定されています。私は、複数の長続きするプログラムから構成されている Python サービス/システムの場合、Erlang/OTP スタイルの supervisor-tree フレームワークである Runsmith をリリースしました。ETL サービスのデータポーラー、トランスフォーマー、結果通知器を考えます。それぞれが独自のライフサイクル、失敗モード、回復要件を持っています。これを手動で接続すると、リトライループ、ウォッチドスレッド、散在するステートフラグによって構成される、脆弱なグルーグルー Code となり、直感的に理屈を巡らせるのが困難です。 ⚙️ Runsmith はこの問題に構造化をもたらします。各単位は明示的な FSM ライフサイクルを持つワーカーカーへ変換されます。監督者ツリーはすべてのワーカーカーを継続的に監視し、スタックやタイムアウトを検出します（単にクラッシュではありません）。そして、再起動を失敗した単体に制限することで、システムの残りが継続して動作します。本当の由来ストーリー：私は職場で安全保護カメラシステムのバックエンドを構築していました。このカメラは製造工場で使用されるため、停止時間は受け入れられません。システムには複数のプロセスが一緒に動作しています： Web アプリ：HTTP API と SSE ストリームを提供するアルゴリズムワーカーカー：入力フレームに対する CV 推論を実行するカメラコントローラー：カメラデバイスライブラリと相互作用し、フレームをポーリングするバックグラウンドタスク：定期データバキューミングなどのスケジューリングジョブのランナー ONVIF サービスそれぞれが永遠に実行され、他のものを引いてこないように失敗から回復する必要がありました。アルゴリズムワーカーカーは、サードパーティ製ドライバの失敗により推論中にスタックしました。FastAPI Web アプリのイベントループは、悪い同期コードを書いた誰かが原因で餓えました... 私の最初のバージョンは、カオスなスープでした。ステートフラグ、リトライロジック、ウォッチド、プローブなど、何かが束ねるために絶望的に試行していました。動作しますが、維持や理解が困難でした。私が実際に望んでいたのは、監督者がフラックスクラス概念であり、故障の隔離が構造的なものであればボルトアップでないフレームワークでした。さらに重要なのは、長続きするステートフル機能単位のモデリングのための統一構造を本当に望んだことです。于是我构建了它。Runsmith は私がそのプロジェクトを開始した頃に存在しなかったもの、実質的には私が望んでいたものです 🤗 supervisord？ Noppe 🙂‍↔️。Runsmith と supervisord は異なる問題を解決します。supervisord は OS レベルのプロセス制御デーモンであり、PID と静的な設定によって外部プログラムを管理します。Runsmith は、監督される単位が明示的なライフサイクルを持つタイプ化されたワーカーカーである、インプロセスでプログラミング可能な Python ライブラリです。これには supervisord に見られないいくつかのメリットがあります：豊富なコンカレンシーモデル：プロセスだけのオーケストレーションを超え、ワーカーカーはスレッド、コルーチン、またはカスタム実行バックエンドで動作できます。微細粒度のヘルスプローブ：故障は単に異常なプロセス出口ではなく、検出され修復可能な制約違反です。監督者ツリー：入れ子された失敗ドメインのための Erlang/OTP スタイルの supervisor-tree。

Original Content

🙋‍♂️ Claim: battle-tested idea, not AI-slop. I carefully designed the architecture and crafted the implementation mostly by myself (>80%). AI assistance is present but mainly for creating unit tests and standalone utils. I just release Runsmith, an Erlang/OTP style supervisor-tree framework for when your Python service/system is made of multiple long-running programs. Think of an ETL service with a data poller, a transformer, and a result notifier, each with its own lifecycle, failure modes, and recovery needs. Wiring this by hand with retry loops, watchdog threads, and scattered state flags brittle glue code that is hard to reason about. ⚙️ Runsmith brings structure to this problem. Each unit becomes a worker with an explicit FSM lifecycle. A supervisor tree monitors every worker continuously — detecting stalls and timeouts, not just crashes — and confines restarts to the failed unit so the rest of the system keeps running. The real origin story: I was building the backend for a safety-protection camera system at work. The camera is used in manufacturing plants so no downtime is unacceptable. The system has multiple processes all running together: Web app: serving the HTTP API and SSE streams Algorithm worker: running CV inference on incoming frames Camera controller: interacting with the camera device library and polling frames Background task: runner for scheduled jobs such as periodic data vacuuming ONVIF service Each one needed to run indefinitely and recover from failures without dragging the others down. The algorithm worker once stall mid-inference due to third-party driver failures. The FastAPI web app event loop once starved due to someone wrote bad sync code... My first version was a messy soup. Lots of state flags, retry logics, watchdogs and probes desperately trying to hold things together. It worked but hard to maintain and reason about. What I actually wanted was a framework where supervision is a first-class concept, fault isolation is structural rather than bolted on. More importantly, I really want an unified structure for modelling long-running stateful function units. So I built it. Runsmith is essentially what I wished had existed when I started that project 🤗 supervisord? Nope 🙂‍↔️. Runsmith and supervisord solve different problems. supervisord is an OS-level process control daemon that manages external programs by PID and static config. Runsmith is an in-process, programmable Python library where the supervised unit is a typed worker with an explicit lifecycle. That gives a few advantages not present in supervisord: Rich concurrency models: beyond process-only orchestration, workers can run in threads or co-routines, or even custom execution backends. Fine-grained health probes: failure is not just an abnormal process exit, but a constraint violation that can be detected and recovered from. Supervisor-tree: Erlang/OTP style supervisor-tree for nested fault domains.