How to use zebee to build highly scalable distributed workflow middleware?
2021-05-04 15:46:22 【Jiedao jdon】
Zeebe It's a new kind of workflow / Choreography Engine , Suitable for cloud native and cloud scale applications . This article describes how to use Zebee Enter a new era of cloud scale workflow automation .Zeebe It's a real distributed system , There are no central components , According to the first class distributed computing concept design , accord with Declaration of responsiveness , Applying high performance computing technology .
Zeebe be based on Event procurement is idea . This means that all changes to the workflow state are captured as events , And these events will be stored with the command in the event log . Both are considered to be recorded in the Journal .DDD Quick tips for fans ： These events are Zeebe Inside , Related to workflow state . If you run your own event source system in a domain , It is usually Domain events Set up and run your own event store .
Events are immutable , So this event log only attaches . Once it's done, there's no change - It's like an accounting magazine . Just attaching logs is relatively easy to handle , because ：
- Because there is no update , You cannot have multiple conflicting updates in parallel . Conflicting changes to state are always captured in clear order as two immutable Events , So that the event source application can determine how to resolve the conflict with certainty . stay RDMS Implementation of counters in ： If multiple nodes update the same data in parallel , Then the updates will overlap each other . This has to be recognized and avoided . The typical strategy is optimistic or pessimistic locking with the database ACID A combination of guarantees . Just attaching logs doesn't need to be done .
- The known policy is to copy only attached logs .
- Keeping these logs is very effective , Because you always write ahead of time . If you perform a sequential write instead of a random write , The performance of the hard disk will be better .
The current state of the workflow can always be derived from these events . This is called projection .Zeebe The projection in is saved internally for use RocksDB Snapshot ,RocksDB It's a very fast key value store . It's also allowed Zeebe Only through key get data . Pure logging doesn't even allow simple queries , for example “ Give me an example of workflow 2 Current state ”.
As logs grow , You have to consider removing old data from it , This is called Log compression . for example , In the ideal world , We can delete the events of all workflow instances that have ended . Unfortunately , It's very complicated , Because events from a single workflow instance can be all over the place - Especially if you remember that workflow instances can run for days or even months . Our experiments clearly show that , Log compression is not only inefficient , And as a result, the logs become very fragmented .
We decided to do things in different ways . Once we have fully processed an event and applied it to the snapshot , We delete it immediately . I'll be back later “ Deal with it completely ”. This allows us to keep our logs clean and tidy all the time , Without losing the benefits of just attaching logs and stream processing .
Zeebe Write log to disk ,RocksDB Also refresh its status to disk . at present , This is the only supported option . We often talk about making storage logic pluggable - For example, support Cassandra - But so far we've focused on file systems , It may even be the best choice for most use cases , Because it's just the fastest and most reliable choice .
The principle of single writing
When you have multiple clients accessing a workflow instance at the same time , You need to do some kind of conflict detection and resolution . When you use RDMS when , Usually by Optimistic locking Or some database magic . Use Zeebe, We use Single Writer Principle That solved the problem . just as Martin Thompson written ：
Contention for mutable state access requires mutual exclusion or conditional update protection . Any one of these protection mechanisms results in queues when applications compete for updates . To avoid this contention and the associated queuing effect , All States should be owned by a single author , In order to mutate , To follow a single author principle .
therefore , The number of threads on our machine is the same as Zeebe Regardless of the overall size of the cluster , There is always only one thread that can write to a log . This is good ： The order is clear , No need to lock , There will be no deadlock . You don't waste time managing contention , But you can do the actual work at any time .
If you want to know if that means Zeebe Only one thread is used to complete the workflow logic , So you're right so far ！ I'll talk about scaling later Zeebe The problem of .
Event processing loop
To better understand what a single thread is doing , Let's see what happens if the client wants to complete the task in the workflow ：
- The client sends the command to Zeebe, This is a non blocking call , But if you like , You can get one later Future To receive a response .
- Zeebe Attach the command to its log .
- Logs are stored on disk （ And copy - I'll work it out later ）.
- Zeebe Some invariants are checked （“ Can I really handle this command now ？”）, Change the snapshot and create a new event to write to the log .
- After checking the invariants , Even if new events have not been written to the log , It also immediately sends a response to the client . It's safe , Because even if the system crashes now , We can always replay the command and get exactly the same result again .
- The resulting event will be attached to the event log .
- Logs are stored on disk and copied .
If you have an in-depth understanding of business thinking , You may ask a question ：“ very good - But if we change RocksDB state （ step 4） And before we log events, the system crashes （ step 6 and 7） What will happen? ？” Good question ！Zeebe Validate the snapshot only after all events have been processed . In any other case , Use older snapshots and reprocess Events .
Stream processors and exporters
I was talking about event procurement / Tracing to the source . actually , It's important to have a related concept ： Stream processing . By events （ Or accurate records ） The only additional log composed of is a fixed flow of events .Zeebe Internal processor based concepts , Each processor is a thread （ As mentioned above ）. The most important processor is actually the implementation BPMN Workflow engine part , So it understands commands and events semantically , And know what to do next . It's also responsible for rejecting invalid orders .
But there are more stream processors , most important of all Exporter, These exporters also handle each event of the stream . One The out of the box exporter is writing all the data to Elasticsearch, It can be kept in the future and queried . for example ,Zeebe Operating tools Operate Using this data to visualize running workflow instances , The state of events, etc .
Each exporter knows the log location where it reads the data . As long as all stream processors have successfully processed the data , It will delete the data , As described in log compression above . The trade-off here is , You cannot add a new stream processor later , And let it replay all events from history , As in the Apache Kafka In the same .
Point to point clustering
To provide fault tolerance and resilience , You can run multiple Zeebe agent , These agents form a peer-to-peer cluster . We design it in a way that doesn't require any central component or coordinator , So there is no single point of failure .
To form a cluster , You need to configure at least one other agent as a known point of contact in the agent . During agent startup , It communicates with this other agent and gets the current cluster topology . after , Use Gossip agreement Keep the cluster view up-to-date and synchronized .
Use Raft Consensus Algorithm to copy
Now you have to copy the event log to other nodes in the network .Zeebe Using distributed consensus - More specifically Raft Consensus algorithm - To copy the event logs between brokers .Atomix As its implementation .
The basic idea is to have a leader and a group of fans . When Zeeber Startup time , They will choose a leader . As the cluster continues to send messages back and forth , Brokers will recognize whether the leader has collapsed or disconnected and try to choose a new leader . Only leaders are allowed write access to data . The data written by the leader is copied to all fans . Only after successful replication , Will be in Zeebe Handling events in the agent （ Or order ）. If you are familiar with CAP Theorem , It means that we decide on consistency rather than usability , therefore Zeebe It's a CP System .（ I asked Martin Kleppmann apologize , He wrote , Please stop calling the database CP or AP, But I think it helps to understand Zeebe The architecture of ）.
We tolerate partitioning the network , Because we have to tolerate partitions in every distributed system , You don't have an impact on that at all （ see also http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with- partition-tolerance / and https://aphyr.com/posts/293-jepsen-kafka）. We decided on consistency, not usability , Because consistency is one of the promises of workflow automation use cases .
An important configuration option is replication group size . In order to elect leaders or successfully replicate data , You need a so-called quorum , It means to other Raft Confirmation of a certain number of members . Because we want consistency , therefore Zeebe A quorum is required ≥（ Copy group size / 2）+ 1. Let's take a simple example ：
- Zeebe node ：5
- Copy group size ：5
- A quorum ：3
therefore , If there is 3 Nodes can access , We can still work . A network segment must reach a legal number to continue to work , If only two nodes are alive, nothing can be done , If there are clients accessing two nodes in this network segment , It cannot continue to access , because CP The system can't guarantee availability .
This avoids the so-called split brain phenomenon , Because you can't end up with two network segments that do conflict work in parallel .
When leaders write log entries , They will be copied to the followers first , And then we can execute .
This means that each processed log entry can be copied correctly . And replication ensures that no committed log entries are lost . A larger replication group size allows for greater fault tolerance , But it will increase the traffic on the network . Because replication of multiple nodes is done in parallel , So it may not actually have a big impact on the delay . Besides , The agent itself is not blocked by replication , Because it can be handled effectively
Replication is also a strategy to overcome the challenges of writing to disk in virtualized and containerized environments . Because in these environments , When data is physically written to disk , You can't control . Even if you call fsync And it tells you that the data is secure , Maybe not . But we prefer to store data in the memory of several servers , Instead of putting it on the disk of one of the servers .
Although replication may increase Zeebe Delay in command processing in , But it won't have much impact on throughput .Zeebe The stream processor in does not jam waiting for a reply from a follower . therefore Zeebe Can continue to process quickly - But the client waiting for his response may have to wait for a while .
To start a new workflow instance or complete a task , You need to work with Zeebe conversation . The easiest way is to take advantage of one of the ready-made language clients , for example Java,NodeJs,C＃,Go,Rust or Ruby. And because of gRPC, You can use almost any programming language .
The client and Zeebe Gateway communication , The latter knows Zeebe Proxy the cluster topology and route the request to the correct leader of the request . This design makes Zeebe In the cloud or Kubernetes It's very easy to run in , Because you only need to access the gateway from the outside .
With partition expansion
up to now , We talked about having only one thread to handle all the work . If you want to take advantage of multiple threads , You must create a partition . Each partition represents a separate physical append log .
Each partition has its own single writer , This means that you can extend with partitions . You can assign partitions
- Different threads or
- Different proxy nodes .
Each partition forms its own Raft Group , So each division has its own leader . If you run Zeebe colony , Then a node can be the leader of a partition , It can also be a follower of other partitions . This can be a very effective way to run a cluster .
All events related to a workflow instance must enter the same partition , Otherwise, we will violate the principle of single writing , It is also not possible to recreate the current state in the proxy node locally .
One challenge is how to determine which workflow instance enters which partition . at present , It's a simple circular mechanism . When starting a workflow instance , The gateway will put it in a partition . Partition ID You can even get workflow instances ID Part of , This makes it very easy for each part of the system to understand the partition where each workflow instance is located .
An interesting use case is message correlation . Workflow instances may wait for messages （ Or event ） arrive . Usually , The message does not know the workflow instance ID, But related to other information , Assuming that order-id. therefore ,Zeebe You need to find out if any workflow instances are waiting to have this order-id The news of . How to scale effectively and horizontally ？
Zeebe Just create a message subscription , It is located on a partition that may be different from the workflow instance . The partition is determined by the hash function on the correlation identifier , Therefore, it can be easily found by the client submitting the message or by the workflow instance reaching the point where it needs to wait for the message . Where does this happen （ See Message buffering ） Not important , Because single writers don't conflict . Message subscription always links back to the waiting workflow instance - Maybe living in another zone .
Please note that , At present Zeebe The number of partitions in the version is static . Once the agent cluster is in production , You can't change it . Although this is Zeebe There may be changes in future versions of , But it's absolutely important to plan a reasonable number of partitions for your use cases from the beginning . There is one Production guidelines can Help you make core decisions .
Multi data center replication
Users often require multiple data center replication . There is no special support at the moment （ not yet ）.Zeebe Clusters can technically span multiple data centers , But you have to be prepared for increased delays . If you set up the cluster in one way , Only nodes from two data centers can reach Arbitration , Even epic disasters , And at the cost of delay .
Why not use Kafka or Zookeeper Well ？
A lot of people are asking why we have to write all of the above on our own , Instead of simply using Apache Zookeeper Cluster managers like that , Not even fully mature Apache Kafka. Here are the main reasons for this decision ：
- Easy to use and easy to get started . We want to avoid using Zeebe Third party dependencies that need to be installed and operated before .Apache Zookeeper or Apache Kafka Not easy to operate .
- efficiency . Cluster management in the core agent allows us to target specific use cases （ Workflow automation ） Optimize it . If you build around an existing generic cluster manager , So some of the features will be harder .
- Support and control . In our long experience as an open source provider , We learned that it's hard to support third-party dependencies at this core level . Of course , We can start hiring the core Zookeeper contributor , But because there are many participants , So it's still hard , So the direction of these projects is not under our own control . adopt Zeebe, We invest in controlling the entire stack , So that we can move at full speed in the direction we want .
In addition to scalability ,Zeebe It can also be in the single High performance on nodes .
therefore , For example, we always try to reduce rubbish .Zeebe Yes, it is Java Compiling .Java There are so-called garbage collection , Unable to close . The garbage collector starts periodically and checks for objects that can be removed from memory . During garbage collection , The system will pause - The duration depends on the number of objects checked or deleted . This pause may add significant delay to your processing , Especially if you process millions of messages per second . therefore Zeebe The best way to program is to reduce garbage .
Another strategy is to use Ring buffer And use batch statements as much as possible . This also allows you to use multiple threads without violating the single writer principle described above . therefore , Whenever you ask Zeebe When sending an event , The receiver adds data to the buffer . From there, , Another thread will actually take over and process the data . Another buffer is used for bytes that need to be written to disk .
This method can realize batch operation .Zeebe You can write a bunch of events to disk at one time ; Or send some events to followers through a network round trip .
Using binary protocol （ Such as gRPC） Simple binary protocol to client and internal , Remote communication can be done very efficiently .
Zeebe It's a new kind of workflow / Choreography Engine , Suitable for cloud native and cloud scale applications .Zeebe With all the other choreography / The difference between a workflow engine is its performance and its design as a truly scalable and resilient system , Without any central component , Or you need a database .
Zeebe Not following the traditional idea of transactional workflow engine , The state is stored in the shared database , And update as you move from one step in the workflow to the next . contrary ,Zeebe Working as an event source system on top of the replicated only attached log . therefore Zeebe And Apache Kafka And other systems have a lot in common .Zeebe Customers can publish / execution work , So the complete reaction equation Reactive.
Contrary to other microservice choreography engines on the market ,Zeebe Focus on visual workflow , Because we think the visualization workflow is When the design , During runtime and operation The key to providing asynchronous interaction visibility .
- 领域驱动设计简介 - danhaywood(点击标题见原文)
- Next generation language robot surpassing Siri and Xiaoai: Turing test of gpt-3
- Class table inheritance pattern
- Block balking mode
- Proteus is a new platform for building streaming microservices using rsocket
- Introduction to Domain Driven Design - danhaywood
- 什么是Little定律(littles law)
- What is little's law
Introduction to OpenAPI specification
Hello CTP series blog
CodeForces CF242E (CodeForces Round 149 Div.2 Problem E)题解
Codeforces cf242e (codeforces round 149 div.2 problem E)
探索使用 Golang 和 Webassembly 构建一个多人游戏服务器
Explore using golang and webassembly to build a multiplayer game server
Laser ＆ Photonics Rev.：驾驭消逝波——一种新型的各向异性超构表面
- Small science: Bi te Se compound heterostructure nanosheets achieve significant improvement in thermoelectric properties
- Laser & Photonics Rev.: driving evanescent waves -- a new type of anisotropic surface
- Aenm: high performance halide solid electrolytes based on cationic lattice control
- Small: driving force and synergistic mechanism in peptide hierarchical assembly
- Rhel8 learning section 3
- 探索使用 Golang 和 Webassembly 构建一个多人游戏服务器
- Explore using golang and webassembly to build a multiplayer game server
- Structure and Deconstruction
- 第一性原则：伊隆·马斯克（Elon Musk）论自我思考的力量 - jamesclear
- The first principle: Elon Musk's view on the power of self thinking - James clear
- Operation and maintenance boss ridicules me, you don't know this?
- Web security practice
- SQL memo
- One year summary of event traceability and cqrs implementation
- Netflix test c assandra:- Million writes per second
- How to reform the current system architecture
- Two terminal network reliability
- 干货 | 日访问过亿，办公IM及开放式平台在携程的实践
- Dry goods daily visit over 100 million, office IM and open platform practice in Ctrip
- Programmer's 4 ability levels and 8 work bad habits, certainly have you
- Mockito 小结
- Summary of mockito
- [玩转UE4动画系统＞基础篇] 之 什么是射线检测
- What is radiographic testing
- Vscode + sublime installation and Sinicization
- Vscode + sublime installation and Sinicization
- 软件设计专家的八种习惯 | 麻省理工学院出版社
- 讨论：这样基于Domain Event的分层是否合理？