当前位置:网站首页>How to use zebee to build highly scalable distributed workflow middleware?

How to use zebee to build highly scalable distributed workflow middleware?

2021-05-04 15:46:22 Jiedao jdon

Zeebe It's a new kind of workflow / Choreography Engine , Suitable for cloud native and cloud scale applications . This article describes how to use Zebee Enter a new era of cloud scale workflow automation .Zeebe It's a real distributed system , There are no central components , According to the first class distributed computing concept design , accord with Declaration of responsiveness , Applying high performance computing technology .

Event source

Zeebe be based on Event procurement is idea . This means that all changes to the workflow state are captured as events , And these events will be stored with the command in the event log . Both are considered to be recorded in the Journal .DDD Quick tips for fans : These events are Zeebe Inside , Related to workflow state . If you run your own event source system in a domain , It is usually Domain events Set up and run your own event store .

Events are immutable , So this event log only attaches . Once it's done, there's no change - It's like an accounting magazine . Just attaching logs is relatively easy to handle , because :

  • Because there is no update , You cannot have multiple conflicting updates in parallel . Conflicting changes to state are always captured in clear order as two immutable Events , So that the event source application can determine how to resolve the conflict with certainty . stay RDMS Implementation of counters in : If multiple nodes update the same data in parallel , Then the updates will overlap each other . This has to be recognized and avoided . The typical strategy is optimistic or pessimistic locking with the database ACID A combination of guarantees . Just attaching logs doesn't need to be done .
  • The known policy is to copy only attached logs .
  • Keeping these logs is very effective , Because you always write ahead of time . If you perform a sequential write instead of a random write , The performance of the hard disk will be better .

The current state of the workflow can always be derived from these events . This is called projection .Zeebe The projection in is saved internally for use RocksDB Snapshot ,RocksDB It's a very fast key value store . It's also allowed Zeebe Only through key get data . Pure logging doesn't even allow simple queries , for example “ Give me an example of workflow 2 Current state ”.

Record compression

As logs grow , You have to consider removing old data from it , This is called Log compression . for example , In the ideal world , We can delete the events of all workflow instances that have ended . Unfortunately , It's very complicated , Because events from a single workflow instance can be all over the place - Especially if you remember that workflow instances can run for days or even months . Our experiments clearly show that , Log compression is not only inefficient , And as a result, the logs become very fragmented .

We decided to do things in different ways . Once we have fully processed an event and applied it to the snapshot , We delete it immediately . I'll be back later “ Deal with it completely ”. This allows us to keep our logs clean and tidy all the time , Without losing the benefits of just attaching logs and stream processing .


Zeebe  Write log to disk ,RocksDB Also refresh its status to disk . at present , This is the only supported option . We often talk about making storage logic pluggable - For example, support Cassandra - But so far we've focused on file systems , It may even be the best choice for most use cases , Because it's just the fastest and most reliable choice .

The principle of single writing

When you have multiple clients accessing a workflow instance at the same time , You need to do some kind of conflict detection and resolution . When you use RDMS when , Usually by Optimistic locking Or some database magic . Use Zeebe, We use Single Writer Principle That solved the problem . just as Martin Thompson written

Contention for mutable state access requires mutual exclusion or conditional update protection . Any one of these protection mechanisms results in queues when applications compete for updates . To avoid this contention and the associated queuing effect , All States should be owned by a single author , In order to mutate , To follow a single author principle .

therefore , The number of threads on our machine is the same as Zeebe Regardless of the overall size of the cluster , There is always only one thread that can write to a log . This is good : The order is clear , No need to lock , There will be no deadlock . You don't waste time managing contention , But you can do the actual work at any time .

If you want to know if that means Zeebe Only one thread is used to complete the workflow logic , So you're right so far ! I'll talk about scaling later Zeebe The problem of .

Event processing loop

To better understand what a single thread is doing , Let's see what happens if the client wants to complete the task in the workflow :


  1. The client sends the command to Zeebe, This is a non blocking call , But if you like , You can get one later Future To receive a response .
  2. Zeebe Attach the command to its log .
  3. Logs are stored on disk ( And copy - I'll work it out later ).
  4. Zeebe Some invariants are checked (“ Can I really handle this command now ?”), Change the snapshot and create a new event to write to the log .
  5. After checking the invariants , Even if new events have not been written to the log , It also immediately sends a response to the client . It's safe , Because even if the system crashes now , We can always replay the command and get exactly the same result again .
  6. The resulting event will be attached to the event log .
  7. Logs are stored on disk and copied .

If you have an in-depth understanding of business thinking , You may ask a question :“ very good - But if we change RocksDB state ( step 4) And before we log events, the system crashes ( step 6 and 7) What will happen? ?” Good question !Zeebe Validate the snapshot only after all events have been processed . In any other case , Use older snapshots and reprocess Events .

Stream processors and exporters

I was talking about event procurement / Tracing to the source . actually , It's important to have a related concept : Stream processing . By events ( Or accurate records ) The only additional log composed of is a fixed flow of events .Zeebe Internal processor based concepts , Each processor is a thread ( As mentioned above ). The most important processor is actually the implementation BPMN Workflow engine part , So it understands commands and events semantically , And know what to do next . It's also responsible for rejecting invalid orders .

But there are more stream processors , most important of all Exporter, These exporters also handle each event of the stream . One The out of the box exporter is writing all the data to Elasticsearch, It can be kept in the future and queried . for example ,Zeebe Operating tools Operate Using this data to visualize running workflow instances , The state of events, etc .

Each exporter knows the log location where it reads the data . As long as all stream processors have successfully processed the data , It will delete the data , As described in log compression above . The trade-off here is , You cannot add a new stream processor later , And let it replay all events from history , As in the Apache Kafka In the same .

Point to point clustering

To provide fault tolerance and resilience , You can run multiple Zeebe agent , These agents form a peer-to-peer cluster . We design it in a way that doesn't require any central component or coordinator , So there is no single point of failure .

To form a cluster , You need to configure at least one other agent as a known point of contact in the agent . During agent startup , It communicates with this other agent and gets the current cluster topology . after , Use Gossip agreement Keep the cluster view up-to-date and synchronized .

Use Raft Consensus Algorithm to copy

Now you have to copy the event log to other nodes in the network .Zeebe Using distributed consensus - More specifically Raft Consensus algorithm  - To copy the event logs between brokers .Atomix As its implementation .

The basic idea is to have a leader and a group of fans . When Zeeber Startup time , They will choose a leader . As the cluster continues to send messages back and forth , Brokers will recognize whether the leader has collapsed or disconnected and try to choose a new leader . Only leaders are allowed write access to data . The data written by the leader is copied to all fans . Only after successful replication , Will be in Zeebe Handling events in the agent ( Or order ). If you are familiar with CAP Theorem , It means that we decide on consistency rather than usability , therefore Zeebe It's a CP System .( I asked Martin Kleppmann apologize , He wrote , Please stop calling the database CP or AP, But I think it helps to understand Zeebe The architecture of ).

We tolerate partitioning the network , Because we have to tolerate partitions in every distributed system , You don't have an impact on that at all ( see also http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with- partition-tolerance / and https://aphyr.com/posts/293-jepsen-kafka). We decided on consistency, not usability , Because consistency is one of the promises of workflow automation use cases .

An important configuration option is replication group size . In order to elect leaders or successfully replicate data , You need a so-called quorum , It means to other Raft Confirmation of a certain number of members . Because we want consistency , therefore Zeebe A quorum is required ≥( Copy group size / 2)+ 1. Let's take a simple example :

  • Zeebe node :5
  • Copy group size :5
  • A quorum :3

therefore , If there is 3 Nodes can access , We can still work . A network segment must reach a legal number to continue to work , If only two nodes are alive, nothing can be done , If there are clients accessing two nodes in this network segment , It cannot continue to access , because CP The system can't guarantee availability .

This avoids the so-called split brain phenomenon , Because you can't end up with two network segments that do conflict work in parallel .


When leaders write log entries , They will be copied to the followers first , And then we can execute .

This means that each processed log entry can be copied correctly . And replication ensures that no committed log entries are lost . A larger replication group size allows for greater fault tolerance , But it will increase the traffic on the network . Because replication of multiple nodes is done in parallel , So it may not actually have a big impact on the delay . Besides , The agent itself is not blocked by replication , Because it can be handled effectively

Replication is also a strategy to overcome the challenges of writing to disk in virtualized and containerized environments . Because in these environments , When data is physically written to disk , You can't control . Even if you call fsync And it tells you that the data is secure , Maybe not . But we prefer to store data in the memory of several servers , Instead of putting it on the disk of one of the servers .

Although replication may increase Zeebe Delay in command processing in , But it won't have much impact on throughput .Zeebe The stream processor in does not jam waiting for a reply from a follower . therefore Zeebe Can continue to process quickly - But the client waiting for his response may have to wait for a while .


To start a new workflow instance or complete a task , You need to work with Zeebe conversation . The easiest way is to take advantage of one of the ready-made language clients , for example Java,NodeJs,C#,Go,Rust or Ruby. And because of gRPC, You can use almost any programming language .

The client and Zeebe Gateway communication , The latter knows Zeebe Proxy the cluster topology and route the request to the correct leader of the request . This design makes Zeebe In the cloud or Kubernetes It's very easy to run in , Because you only need to access the gateway from the outside .

With partition expansion

up to now , We talked about having only one thread to handle all the work . If you want to take advantage of multiple threads , You must create a partition . Each partition represents a separate physical append log .

Each partition has its own single writer , This means that you can extend with partitions . You can assign partitions

  • Different threads or
  • Different proxy nodes .

Each partition forms its own Raft Group , So each division has its own leader . If you run Zeebe colony , Then a node can be the leader of a partition , It can also be a follower of other partitions . This can be a very effective way to run a cluster .

All events related to a workflow instance must enter the same partition , Otherwise, we will violate the principle of single writing , It is also not possible to recreate the current state in the proxy node locally .

One challenge is how to determine which workflow instance enters which partition . at present , It's a simple circular mechanism . When starting a workflow instance , The gateway will put it in a partition . Partition ID You can even get workflow instances ID Part of , This makes it very easy for each part of the system to understand the partition where each workflow instance is located .

An interesting use case is message correlation . Workflow instances may wait for messages ( Or event ) arrive . Usually , The message does not know the workflow instance ID, But related to other information , Assuming that order-id. therefore ,Zeebe You need to find out if any workflow instances are waiting to have this order-id The news of . How to scale effectively and horizontally ?

Zeebe Just create a message subscription , It is located on a partition that may be different from the workflow instance . The partition is determined by the hash function on the correlation identifier , Therefore, it can be easily found by the client submitting the message or by the workflow instance reaching the point where it needs to wait for the message . Where does this happen ( See Message buffering ) Not important , Because single writers don't conflict . Message subscription always links back to the waiting workflow instance - Maybe living in another zone .

Please note that , At present Zeebe The number of partitions in the version is static . Once the agent cluster is in production , You can't change it . Although this is Zeebe There may be changes in future versions of , But it's absolutely important to plan a reasonable number of partitions for your use cases from the beginning . There is one Production guidelines can Help you make core decisions .

Multi data center replication

Users often require multiple data center replication . There is no special support at the moment ( not yet ).Zeebe Clusters can technically span multiple data centers , But you have to be prepared for increased delays . If you set up the cluster in one way , Only nodes from two data centers can reach Arbitration , Even epic disasters , And at the cost of delay .

Why not use Kafka or Zookeeper Well ?

A lot of people are asking why we have to write all of the above on our own , Instead of simply using Apache Zookeeper Cluster managers like that , Not even fully mature Apache Kafka. Here are the main reasons for this decision :

  • Easy to use and easy to get started . We want to avoid using Zeebe Third party dependencies that need to be installed and operated before .Apache Zookeeper or Apache Kafka Not easy to operate .
  • efficiency . Cluster management in the core agent allows us to target specific use cases ( Workflow automation ) Optimize it . If you build around an existing generic cluster manager , So some of the features will be harder .
  • Support and control . In our long experience as an open source provider , We learned that it's hard to support third-party dependencies at this core level . Of course , We can start hiring the core Zookeeper contributor , But because there are many participants , So it's still hard , So the direction of these projects is not under our own control . adopt Zeebe, We invest in controlling the entire stack , So that we can move at full speed in the direction we want .

performance design

In addition to scalability ,Zeebe It can also be in the single ​​ High performance on nodes .

therefore , For example, we always try to reduce rubbish .Zeebe Yes, it is Java Compiling .Java There are so-called garbage collection , Unable to close . The garbage collector starts periodically and checks for objects that can be removed from memory . During garbage collection , The system will pause - The duration depends on the number of objects checked or deleted . This pause may add significant delay to your processing , Especially if you process millions of messages per second . therefore Zeebe The best way to program is to reduce garbage .

Another strategy is to use Ring buffer And use batch statements as much as possible . This also allows you to use multiple threads without violating the single writer principle described above . therefore , Whenever you ask Zeebe When sending an event , The receiver adds data to the buffer . From there, , Another thread will actually take over and process the data . Another buffer is used for bytes that need to be written to disk .

This method can realize batch operation .Zeebe You can write a bunch of events to disk at one time ;  Or send some events to followers through a network round trip .

Using binary protocol ( Such as gRPC) Simple binary protocol to client and internal , Remote communication can be done very efficiently .


Zeebe It's a new kind of workflow / Choreography Engine , Suitable for cloud native and cloud scale applications .Zeebe With all the other choreography / The difference between a workflow engine is its performance and its design as a truly scalable and resilient system , Without any central component , Or you need a database .

Zeebe Not following the traditional idea of transactional workflow engine , The state is stored in the shared database , And update as you move from one step in the workflow to the next . contrary ,Zeebe Working as an event source system on top of the replicated only attached log . therefore Zeebe And Apache Kafka And other systems have a lot in common .Zeebe Customers can publish / execution work , So the complete reaction equation Reactive.

​​​​​​​ Contrary to other microservice choreography engines on the market ,Zeebe Focus on visual workflow , Because we think the visualization workflow is When the design , During runtime and operation The key to providing asynchronous interaction visibility .

本文为[Jiedao jdon]所创,转载请带上原文链接,感谢