dev_to 2026年4月17日

Apache SeaTunnel Zeta が「高速かつ安定」になり得る理由

Why Apache SeaTunnel Zeta Can Be Both “Fast and Stable”

Translated: 2026/4/17 11:17:59

apache-seaTunneldata-integrationexactly-oncecheckpointingdata-pipeline

Japanese Translation

SeaTunnel Zeta を単に「高速な実行エンジン」と理解するだけでは、その真価は過小評価されることになります。データ統合システムにおける真の課題とは、パイプラインが動けるかどうかではなく、以下の条件を同時に達成できるかです。十分なスループット、失敗からの回復能力、データの重複や損失の欠如、そしてリソース消費の制御です。 Zeta が真に注目を集める理由はまさにここにあります：単一の性能最適化で勝つのではなく、一貫性、回復、並行処理下での収束、リソース制御を、閉ループシステム能力に変換するからです。注記：この記事は SeaTunnel コミット c5ceb6490 に基づいており、ソースコードの解釈はこのバージョンに適用されます。 Runtime 観測は公式の apache/seatunnel:2.3.13 画像に基づいており、メカニズムを理解する助けをなすことを意図しており、このコミットのための厳密なベンチマークとしては使用していません。結論：まずアーキテクトの観点から言えば、SeaTunnel Zeta は単一の「性能最適化点」を通じて高速性と安定性を両立させるのではなく、4 つの能力の閉ループを形成します：コントール平面：チェックポイントがトリガーされた場合、タイムアウトした場合、そして完了した場合の対応状態平面：タスクステートのスナップショット化、保存、回復、再マッピングの仕組みデータ平面：高並行下で Barrier、Record、Close 信号が順次収束する仕組みリソース平面：システムが自己を圧倒しないように、リソースをモデル化し、割り当て、スロットリングする仕組みこの 4 つの層はどれでも欠かすことはできません。どの層の契約が破られたら、最終的には書き込みの重複、停滞した回復、チェックポイントタイムアウト、あるいはリソース不安定が生じます。 1. グローバル画像を見る：Zeta は単に「高速」ではなく「高速かつ安定」を解決するデータ統合システムにおける最も典型的な矛盾とは、単に「動くかどうか」ではなく、以下の 3 つの条件を同時に満たせるかどうかです。スループットが十分高く、ボトルネックにならないよう失敗後に回復可能で、再起動時にデータ損失や重複がないようリソース消費を制御可能に、安定性の追求でクラスタを枯渇させないようこのため、私は Zeta を汎用計算エンジンではなく、データ統合シナリオの安定性エンジンと理解することを好みます。ソースコード設計からは、問題は 4 つの明確に定義された平面に分解されています：コントール平面：CheckpointCoordinator は、トリガー、進行、完了、タイムアウト、終了のチェックポイントを担当状態平面：CheckpointStorage、CompletedCheckpoint、ActionSubtaskState はスナップショット化と回復を担当データ平面：SourceSplitEnumeratorTask、Writers、Aggregated Committer、および中間キューは、データ処理フローに制御信号を組み込むリソース平面：ResourceProfile、DefaultSlotService、read_limit はリソースプロファイリング、動的割り当て、スロットリングを担当 1.1 アーキテクチャの概要アーキテクチャ判断：Zeta のハイライトは個別モジュールの複雑さではなく、「一貫性、回復、並行、リソース」を統一的なプロトコルの中に配置していることです。 2. Exactly-Once は単一の機能ではなく、クロスレイヤー契約である多くの記事は Exactly-Once を「エンジンがチェックポイントをサポートするため、Exactly-Once が保証される」と記述しますが、これはアーキテクチャ的観点からは厳密ではありません。 Zeta において Exactly-Once は少なくとも 2 つのレイヤーに分類されます：エンジンレベルの保証：バーリアアラインメント、ステートスナップショット化、完了順序、失敗ロールバックコネクタレベルの保証：prepareCommit が複製可能かつリプレイ可能な CommitInfo を出力し、commit がアイデムポotent でリトライ可能であるつまり、Zeta は Exactly-Once の実行フレームワークを提供し、すべてのコネクタに対して自動的にそれを保証するものではありません。さらに、Sink 側には 1 つの commit パスしかないわけではありません：コネクタが SinkAggregatedCommitter を実装する場合は、Writer prepareCommit → Aggregated Commit のパスに従います

Original Content

If SeaTunnel Zeta is simply understood as “a faster execution engine,” its true value will be underestimated. For data integration systems, the real challenge has never been “whether the pipeline can run,” but whether the following can be achieved at the same time: sufficiently high throughput, recoverability after failure, no data duplication or loss, and controlled resource consumption. What makes Zeta worth serious attention lies exactly here: it does not win through a single performance optimization, but instead turns consistency, recovery, convergence under concurrency, and resource control into a closed-loop system capability. Note: This article is based on SeaTunnel commit c5ceb6490; all source code interpretations refer to this version. Runtime observations are based on the official apache/seatunnel:2.3.13 image and are intended to help understand the mechanisms, not as a strict benchmark for this commit. Conclusion First From an architect’s perspective, SeaTunnel Zeta does not achieve both high throughput and stability through a single “performance optimization point,” but instead forms a closed loop of four capabilities: Control plane: when checkpoints are triggered, timed out, and completed State plane: how task state is snapshotted, persisted, restored, and remapped Data plane: how Barrier, Record, and Close signals converge in order under high concurrency Resource plane: how resources are modeled, allocated, and throttled to prevent the system from overwhelming itself None of these four layers can be missing. If the contract of any layer is broken, it will eventually manifest as duplicate writes, stalled recovery, checkpoint timeouts, or resource instability. 1. Looking at the Big Picture: Zeta Solves Not Just “Fast,” but “Fast and Stable” The most typical contradiction in data integration systems has never been “whether they can run,” but whether the following three conditions can be satisfied simultaneously: Throughput is high enough to avoid becoming a bottleneck Recoverable after failure, without data loss or duplication upon restart Resource consumption is controllable, without exhausting the cluster in pursuit of stability This is why I prefer to understand Zeta as a stability engine for data integration scenarios, rather than a generalized computing engine. From the source code design, it decomposes the problem into four clearly defined planes: Control plane: CheckpointCoordinator is responsible for triggering, progressing, completing, timing out, and terminating checkpoints State plane: CheckpointStorage, CompletedCheckpoint, and ActionSubtaskState handle snapshotting and recovery Data plane: SourceSplitEnumeratorTask, Writers, Aggregated Committer, and intermediate queues embed control signals into the data processing flow Resource plane: ResourceProfile, DefaultSlotService, and read_limit handle resource profiling, dynamic allocation, and throttling 1.1 Architecture Overview Architectural judgment: The highlight of Zeta is not the complexity of individual modules, but that it places “consistency, recovery, concurrency, and resources” into a unified protocol. 2. Exactly-Once Is Not a Single Capability, but a Cross-Layer Contract Many articles describe Exactly-Once as “the engine supports checkpoints, therefore Exactly-Once is guaranteed.” This is not rigorous from an architectural perspective. In Zeta, Exactly-Once is at least divided into two layers: Engine-level guarantees: Barrier alignment, state snapshotting, completion ordering, and failure rollback Connector-level guarantees: prepareCommit must produce transferable and replayable CommitInfo, and commit must be idempotent and retryable In other words, Zeta provides an execution framework for Exactly-Once, rather than automatically guaranteeing it for all connectors. In addition, the Sink side does not have only one commit path: If the connector implements SinkAggregatedCommitter, it follows the path: Writer prepareCommit → Aggregated Committer aggregation → unified commit after notifyCheckpointComplete If the connector only implements SinkCommitter, the commit happens directly inside notifyCheckpointComplete(...) of the Writer task The following analysis focuses on the first path, as it better reflects Zeta’s coordination of consistency and commit timing at the engine level. 2.1 What It Actually Guarantees Taking the SinkAggregatedCommitter path as an example, the Exactly-Once main flow in Zeta is: CheckpointCoordinator triggers a checkpoint and injects barriers into tasks Each participant snapshots state at the barrier boundary and sends ACK Sink Writer calls prepareCommit(checkpointId) without committing externally SinkAggregatedCommitterTask aggregates CommitInfo and includes the result in checkpoint state Only when the Coordinator determines the checkpoint is complete does it trigger the actual commit(...) The architectural meaning of this chain is very clear: first solidify the consistency boundary, then perform external side effects. 2.2 Why This Design Matters If the Writer commits to the external system immediately after local processing, once the checkpoint fails to complete, the system will face two classic problems after recovery: State not saved but external commit already happened → irreversible duplication Upstream replay writes again → logically at-least-once, but claimed as Exactly-Once Zeta delays the commit action until after notifyCheckpointComplete, essentially doing one thing: binding external visible side effects to the completion of consistency. 2.3 Architectural Boundaries Must Be Clear If this is not clearly stated, it is easy to misinterpret: SinkWriter.prepareCommit(checkpointId) is not a normal flush, but a phase-one protocol action SinkCommitter.commit(...) must be idempotent, otherwise duplicates may still occur after recovery If the external system does not support idempotency or transactional semantics, engine-level Exactly-Once will degrade Architectural judgment: Exactly-Once is not a “switch,” but a responsibility chain across engine, connectors, and external systems. 2.4 What Is the Cost Every architectural benefit comes with a cost, and Exactly-Once is no exception: The more frequent the checkpoints, the higher the cost of Barrier handling and state serialization External commits are delayed, introducing additional commit paths and state buffering If Sink idempotency is not well designed, complexity shifts to connector implementers 3. The Key to Resume Is Not Just Restoring State, but Restoring Protocol Progress Many systems stop at “restoring state objects.” But in distributed data integration, this is not enough, because the protocol itself has progress. Three points in Zeta’s recovery path are particularly worth attention. 3.1 Recovery Is Not a Direct Restore, but a Remapping Based on Current Parallelism CheckpointCoordinator.restoreTaskState(...) does not simply assign old state back to the original subtask. Instead, it determines the correct execution unit based on current parallelism and mapping. This means it considers not “who ran last time,” but “who should take over this time.” This is crucial because real-world recovery often involves: Worker relocation Parallelism changes Slot reallocation 3.2 The Core of Source Recovery Lies in the Enumerator On the Source side, what truly determines whether reading can continue correctly is not just the reader itself, but the allocation state of splits. Therefore, Zeta places the recovery focus on SourceSplitEnumerator: During checkpoint: execute snapshotState(checkpointId) During recovery: SourceSplitEnumeratorTask.restoreState(...) decides whether to call restoreEnumerator(...) or createEnumerator(...) Then open() is invoked and subsequent coordination resumes This shows that its recovery approach is not about “restoring threads,” but about “restoring the scheduler.” 3.3 What Truly Reflects Stability Engineering Is “Protocol Signal Compensation” One of the most valuable details in this article is the re-signaling logic of NoMoreSplits after reader re-registration. In SourceSplitEnumeratorTask.receivedReader(...), if a reader has previously been marked as having no more splits, then when it re-registers after recovery, the system will again call signalNoMoreSplits. This detail is highly significant: What is restored is not just data state Nor just split allocation results But also the fact that “this reader has already reached the end of the protocol” Without this step, the system may appear to have “successfully restored state,” but the reader could remain stuck waiting for more splits indefinitely. Architectural judgment: A truly mature recovery mechanism restores “state + protocol position + control signals,” not just a serialized object. 4. In High-Concurrency Systems, the Real Risk Is Not Slowness, but Lack of Convergence When people think of high concurrency, they often think of parallelism, threads, and queue length. But for data integration engines, the more dangerous issue is actually: whether control messages are drowned out, and whether the shutdown process loses control. Zeta’s design here reflects a clear engineering mindset. 4.1 The Parallel Model Is Not the Highlight, the Convergence Model Is From the task model perspective, Zeta’s high concurrency is not mysterious: Source/Sink improve throughput via multiple Readers and Writers Pipelines scale throughput via task parallelism Aggregated Committer waits until all necessary writers are registered and aligned before advancing lifecycle These are standard practices in distributed execution engines. What stands out is that it does not treat “parallelism” as simply increasing processing threads, but treats how to terminate in an orderly way under concurrency as a first-class concern. 4.2 Barrier Priority Is Essentially Protecting the Control Plane In the implementations of RecordEventProducer and IntermediateBlockingQueue, when a Barrier arrives, it is acknowledged with priority. If that Barrier triggers prepareClose for the current task, the system enters the prepareClose state, and ordinary records are no longer accepted into the queue. This design addresses two common pitfalls in high-concurrency systems: Control signals being drowned by data traffic: Barriers cannot reach boundaries, and consistency cannot converge Data still flowing during shutdown: Records continue after checkpoint boundaries, breaking semantics In other words, this is not “queue optimization,” but an architectural decision where control takes priority over throughput. 4.3 Why This Is Especially Important for Data Integration Systems In data integration pipelines, downstream systems are often slower than upstream, and network/storage jitter is common. If the system simply increases concurrency mechanically, three consequences arise: Queue buildup worsens Checkpoint cost increases Shutdown and recovery become harder to converge So what Zeta demonstrates here is not just “high concurrency capability,” but: It knows when to continue throughput, and when to first enforce consistency and lifecycle convergence. 5. Low Resource Usage Is Not About Using Fewer Machines, but About Restraining Resource Decisions “Low resource usage” is often misunderstood as “this engine consumes fewer machines.” Architecturally, a more accurate statement is: The system avoids wasting resources on ineffective competition through a simpler resource model and explicit throttling mechanisms. 5.1 The Value of a Minimal Resource Model Lies in Low Scheduling Cost ResourceProfile uses CPU and Memory as core resource descriptors, and provides merge, subtract, and enoughThan. This is not a highly detailed model, but it has two practical advantages: Simplicity → low scheduling computation cost Generality → suitable for volatile and heterogeneous data integration workloads The trade-off is also clear: it has limited expressiveness for network, disk, and downstream service bottlenecks. Architectural judgment: This is a “good enough” resource model, not a “precise simulation” model. 5.2 Dynamic Slots Are Essentially Elastic Partitioning Based on Remaining Capacity In DefaultSlotService.requestSlot(...), if dynamic slots are enabled and remaining resources can satisfy the requested profile, a new SlotProfile is created on demand. This means slots are not statically partitioned, but dynamically sliced based on available capacity. Benefits: Higher resource utilization More flexible scheduling Suitable for mixed workloads with fluctuating load But this does not mean the system is immune to overload. If upstream jobs expand parallelism uncontrollably, dynamic slots will only expose the problem faster. 5.3 What Actually Suppresses Resource Instability Is Checkpoint Throttling checkpointInterval, checkpointMinPause, and checkpointTimeout are not just configurations, but stability valves: interval: how frequently snapshots occur minPause: enforced gap between checkpoints timeout: maximum duration before abort Improper configuration leads to a vicious cycle: Frequent checkpoints → higher state cost → slower barriers → more timeouts → more recovery → increased resource instability 5.4 Throttling Is Often More Effective Than Scaling Configurations like read_limit.rows_per_second and read_limit.bytes_per_second have high architectural value. Because often the system is not “computationally insufficient,” but: Downstream cannot keep up Excessive concurrency only creates retries and backlog Resources are wasted on ineffective contention Therefore, for slow or rate-limited downstream systems, the recommended approach is: Throttle first, observe, then scale. 5.5 Closed Loop of Resource Scheduling and Throttling 6. From an Architectural Perspective, What Scenarios Is Zeta Suitable For From the current design, Zeta’s strengths are clear: Clear data integration pipelines from Source to Sink Need for recoverable and traceable consistency guarantees Production environments where manual intervention after recovery is unacceptable Desire to maintain stable operation under limited resources via dynamic allocation and throttling Correspondingly, its focus is not on maximizing every operator capability, but on: Clearly defining consistency boundaries Completing recovery loops Ensuring convergence under concurrency Turning resource control into a system-level capability 7. If You Want to Apply It in Practice, Focus on These Four Things 7.1 For Connector Developers Do not treat prepareCommit(checkpointId) as a normal flush commit(...) must be idempotent and retryable External side effects must align with checkpoint completion 7.2 For Source Developers snapshotState(...) and run(...) may run concurrently; ensure thread safety Fully implement addSplitsBack(...) and reader failover Do not only restore split state while ignoring protocol termination signals 7.3 For Operators Do not assume higher parallelism is always better Tune checkpoint.interval, checkpoint.timeout, and min-pause first Use read_limit for fragile downstream systems Prefer cluster mode for savepoint / restore demonstrations 7.4 For Architecture Reviewers Evaluate Exactly-Once together with external system idempotency Evaluate recovery beyond state snapshots, including protocol compensation Evaluate performance not just by throughput, but by convergence during shutdown and recovery It is not valid in architecture articles to directly conclude that an "architecture is advanced" based only on a set of Total Read/Write and Total Time. The sample statistics in the quick-start documentation can only demonstrate three things at most: The pipeline is runnable. Read/write forms a closed loop. No failures occur in the minimal environment. It alone cannot prove upper limits of high concurrency, recovery efficiency, or cost-performance ratio under different resource specifications. I performed three additional minimal run validations: environment is a single Ubuntu host with 8 vCPU / 15Gi RAM, running the official apache/seatunnel:2.3.13 image in local mode. Official batch template: 32 / 32 / 0, total time 3s Custom batch job, parallelism=1, row.num=1000: 1000 / 1000 / 0, total time 3s Custom batch job, parallelism=4, row.num=1000: 4000 / 4000 / 0, total time 3s These three sets of data clearly show: the same total time may correspond to completely different data volumes and parallelism settings. Therefore, drawing conclusions about "performance" without parallelism, data scale, resource specifications, and job type easily leads to distortion. In a batch job lasting approximately 12s, I added two sets of local-mode control-plane validations: When checkpoint.interval = 2000, 5 regular checkpoints completed plus 1 final checkpoint were observed. After adding min-pause = 5000, only 2 regular checkpoints plus 1 final checkpoint were observed within similar job duration. After adding read_limit.rows_per_second = 5, for the same 100 rows, job duration increased from ~12s to ~21s. This shows that min-pause and read_limit are not "decorative configurations" — they actually change control rhythm and runtime. I also performed a validation in single-machine cluster mode specifically for savepoint / restore: After running for 8s in a ~50s batch job, job status remained RUNNING, and checkpoint overview recorded 6 completed checkpoints. After executing -s, job status became SAVEPOINT_DONE, and SAVEPOINT_TYPE appeared in checkpoint history. Using the same jobId to execute -r for restoration, foreground restoration completed in ~37s, final statistics 500 / 500 / 0. From only the final line 500 / 500 / 0, you cannot tell whether it "resumed from a breakpoint." But combined with the prior ~16s runtime and savepoint records, a more reasonable engineering judgment is: the restoration processed remaining splits, not a full re-run. I also tested adding read_limit.bytes_per_second = 10000 to a large-field example; total duration remained ~12s. FakeSource split reading became the bottleneck first — not simply that "byte rate limiting does not work." discussing performance numbers without load context easily leads to misjudgment. Of course, these are only runtime observations, not strict benchmarks based on the c5ceb6490 build. Instead of only looking at throughput, I suggest observing four types of metrics simultaneously: Consistency metrics: duplication, loss, unfinished commits Recovery metrics: time to recover after failure, need for manual intervention Resource metrics: CPU, Heap, thread count, checkpoint duration Convergence metrics: data inflow during shutdown, barrier delays Two recommended comparison scenarios: env { job.mode = "STREAMING" parallelism = 128 checkpoint.interval = 1000 } source { FakeSource { row.num = 100000000 split.num = 128 split.read-interval = 1 } } sink { Console { } } env { job.mode = "STREAMING" parallelism = 32 checkpoint.interval = 5000 } source { FakeSource { row.num = 100000000 split.num = 32 split.read-interval = 100 } } sink { Console { } } The above two configurations are more suitable for observing control links and recovery behavior, not for serious throughput benchmarking. FakeSource in c5ceb6490 supports split.read-interval, not rate. In addition, row.num in FakeSource means total generated rows per parallelism. What these two scenarios truly compare is not just "who is faster," but: Whether higher parallelism actually delivers effective throughput Whether shorter checkpoint intervals stabilize recovery boundaries or cause timeouts Whether the system throttles gracefully when sinks slow down, or amplifies congestion A practical observation: in my minimal tests, min-pause did reduce checkpoint count within the same time window, and read_limit did increase total runtime. Both configurations are observable and verifiable. If we regard Zeta as a stability engine, its most promising future direction may not be stacking more "performance parameters," adaptive capabilities. For example: When Checkpoint slows down, can the system automatically identify whether the bottleneck is Source, Queue, Sink, or insufficient Slot resources? When downstream writing slows, can the system automatically adjust read_limit based on real-time metrics, instead of requiring manual throttling after backlog occurs? When a job recovers, can the system inform the user in advance: which checkpoint recovery starts from, how many splits remain, expected impact scope? Furthermore, Exactly-Once capabilities on the connector side can become more explicit. This does not mean the current version fully supports these capabilities, Once the control plane, state plane, data plane, and resource plane form a closed loop, "recover after failure" to "predict before failure, adapt during runtime." 11. Final Thoughts: What Makes Zeta Valuable Is Turning Stability into a System Capability Looking at individual code points, many implementations in Zeta are not particularly flashy. But architecturally, it gets several critical things right: CheckpointCoordinator as a unified consistency control entry Aggregated Committer binding external commits to checkpoint completion restoreTaskState(...) and Enumerator-based recovery forming a complete resume loop Barrier priority and prepareClose ensuring convergence under concurrency ResourceProfile, dynamic slots, and read_limit making resource control a system-level strategy What deserves recognition is not a single powerful module, but that it places the most failure-prone aspects of data integration systems into a unified, explainable engineering mechanism. If you are an architect, what matters is not just whether it is fast, but whether it remains explainable, convergent, and operable under failure, recovery, commit, and resource fluctuation. From this perspective, Zeta’s real value is not extreme optimization in one area, but placing these concerns into a system that can be traced, verified, and reasoned about. SeaTunnel Zeta’s competitiveness lies not in pushing a single capability to the extreme, but in closing the loop across consistency, recovery, concurrency, and resource management. Appendix: Source Code Reference Anchors If you want to further explore the source code, it is recommended to start with the following entry points. You can also follow the official SeaTunnel channel and reply with the keyword “anchors” to get more materials. CheckpointCoordinator.tryTriggerPendingCheckpoint https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L500-L582 CheckpointCoordinator.restoreTaskState https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L306-L344 SeaTunnelSink https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-api/src/main/java/org/apache/seatunnel/api/sink/SeaTunnelSink.java#L40-L127 SinkFlowLifeCycle.received / notifyCheckpointComplete https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkFlowLifeCycle.java#L191-L244 SinkAggregatedCommitterTask.notifyCheckpointComplete https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java#L303-L332 SourceSplitEnumeratorTask.restoreState https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L187-L207 SourceSplitEnumeratorTask.receivedReader https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L221-L246 DefaultSlotService.requestSlot https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/service/slot/DefaultSlotService.java#L168-L189 speed-limit.md https://github.com/apache/seatunnel/blob/c5ceb6490/docs/zh/introduction/configuration/speed-limit.md