当前位置:网站首页>How can complex systems maintain stability while upgrading without downtime? You have to think about the following

How can complex systems maintain stability while upgrading without downtime? You have to think about the following

2020-12-06 14:57:29 The wind and waves are as calm as a yard

background

In the Internet industry , The upgrading of online services is a common occurrence . According to statistics , Over the past quarter, idle fish engineers have carried out more than a thousand releases , The total number of updated code is over a million lines .

In these releases , Some may only have updated a few lines of code , Some may have performed the migration and upgrade of the whole cluster . And no matter how big the impact of these changes is , We all have to ensure the availability of online services , Users have no perception . This article will take the migration and upgrading of the idle fish search service as an example , Let's introduce the technical scheme behind it .

The underlying search service of idle fish is provided by query planning service Search Planner、 Query understanding service Query Planner、 Ranking service Rank Service And search engines Heaven Ask 3 The composition of . The calling relationship between them is shown in the figure below :

You can see , The whole search service is composed of several independent microservices . Different microservices are isolated from each other , Providing services through pre exposed interfaces . All of the microservices eventually go through Search Planner Close up , To provide unity to the outside world 、 Full search capabilities .

On top of the underlying search service , There are business logic layer and access gateway layer , The specific structure will not be repeated here . The user's search request is first forwarded to the logic layer through the gateway layer , Then send a search request to the underlying search service . This request chain contains dozens of clusters , The call depth reaches two digits , There may be hundreds of servers providing services throughout the process .

For such a complex system , Obviously, the upgrade process can't be accomplished overnight . The good news is that the reasonable decoupling of various microservices has brought great convenience to the upgrading work , It can effectively prevent the whole body from being unable to start , So that we can deal with the upgrade problem in different categories .

  • notes 1:Search Planner It's a function based 、 As a service 、 visualization 、 The search service gateway layer built by parallelization development framework .

  • notes 2:Query Planner The main role of is to understand user input , Then the search terms are optimized . Finally get better search recall results .

  • notes 3:Rank Service It's a real-time ranking service , Its function is to score the audition results of the search engine recall according to the multi-dimensional characteristics . The higher the score, the more likely the product will be at the top of the search results .

  • notes 4:Heaven Ask 3 ( Ask questions 3) It is a stable and efficient product developed by Alibaba 、 Powerful search engine . For Ali group, including Taobao 、 The core business including tmall provides search service support .

Keep compatible

Before you start the upgrade , We first need to confirm whether the upgraded service maintains forward and backward compatibility . Keeping compatible not only reduces the workload , It also reduces the risk of failure caused by upgrades .

In order to avoid incompatibility caused by upgrade , We can summarize some development principles :

  • Remote procedure call (RPC) Need to be able to ignore unknown parameters , And allow missing parameters .

  • If you need to delete an existing parameter , Need to confirm with all relying parties . You can mark the parameter as Deprecated Instead of just removing .

  • When using parameters , Distinguish between default and missing values .

  • If the interface is not compatible , Create a new interface instead of the old one . Don't break old interface compatibility .

When upgrading , Upgrade services that have no external dependencies first . Wait until the dependent party upgrades , Then upgrade the relying party . After determining the upgrade order of each service , We will determine the upgrade plan according to the actual situation of the service .

Stateless service upgrade

Officially enter the upgrade process , We first focus on the part of the search link that is designed as a stateless service , For example, for business logic Java Microservices 、 For processing query logic Search Planner etc. . Their common feature is , After each request is processed , The resource about the request is released . There are no interdependencies and timing requirements between different requests . Different machine nodes in the same stateless service are completely equivalent .

The characteristics of stateless services make it easy for them to scale dynamically through horizontal expansion . So on the premise of compatibility , Their upgrade process is relatively common and simple :

  1. The number of batches is determined according to the minimum availability of the service .

  2. Select a batch of containers to be updated , Out of Service .

  3. Batch upgrade containers 、 Update image .

  4. Wait for this batch of containers to resume service , Continue to update the next batch of containers .

In general, we can store the state in the message queue 、 cache 、 Database or other external middleware to achieve stateless service . The benefits of designing services as stateless are obvious : There is no need to allocate additional machine resources when upgrading , Fast upgrade , The cost of change is small , Therefore, it can support frequent iterative updates . however , This design also brings additional overhead to state access and update , It may not be applicable in some performance sensitive situations .

Stateful service upgrade

We continue to focus on the state part . The trouble with stateful service upgrades is that , State storage 、 recovery 、 The transition is often designed by the service separately according to the actual situation ( Or no design at all ), So upgrading is more difficult . We can simply list some relatively common stateful service upgrade options .

  • Access layer gateway provides the ability of heating and updating ( for example Nginx), Isolate the maintenance of state in the access layer . It's suitable for scenes that need to stay in state for a long time .

  • Progressive update , New requests are gradually switched to new services , The old service is destroyed after it has processed the stock request . It is suitable for short-term state keeping ( For example, game services 、 Real time audio and video communication service ).

  • Create a new copy of the service , Keep the old and new service state consistent through data double writing , Gradually replace the old with new services .

In the framework of idle fish search , Although the search engine itself provides stateless Services , But the engine has internal storage for processing index partitions , Various states of incremental progress . The final upgrade plan is as follows :

  1. Use the new version of the image to create a completely independent new engine .

  2. Full data synchronization between old and new engines .

  3. Incremental data is sent to both old and new engines .

  4. The new engine goes online , Gradually expand the proportion of flow to undertake .

  5. The old engine no longer takes on the traffic and goes offline .

Compared with the upgrade of stateless Services , This method not only uses twice the additional machine resources , And every upgrade requires a complex and cumbersome service configuration . If the service itself is not stateless , We also need to encode the stream switching logic , Ensure that the requests of the same user can fall on the same cluster . The overall upgrade cost is more expensive , Only suitable for services with very low update frequency . If the service is updated more frequently , According to the actual situation of the service, we should design a scheme with lower upgrade cost .

Service discovery

In the process of upgrading , Service discovery mechanism plays an important role . It provides us with the following functions :

  • Ensure distributed consistency

  • Elegant service online and offline

  • Load balancing

  • Traffic control and request degradation

  • The same machine room priority scheduling

  • Cross machine room disaster recovery scheduling

Service discovery is the main valve of flow control . A mature and stable service discovery mechanism can not only effectively avoid the request success rate jitter caused by publishing , It also provides a guarantee for fast roll back hemostasis in case of abnormality .

Risk prevention and control

Each cluster of the search link is upgraded according to the dependent order 、 mount 、 Cutting flow is undoubtedly a high-risk operation , A little carelessness may lead to online failure . therefore , We have sorted out the upgrading process according to Alibaba's principle of safe production :

  • Can be monitored : The important indicators of important links are guaranteed to monitor coverage in advance . For example, the total number of requests , Request success rate , Request response time and so on . To ensure that significant issues can be detected in a timely manner through monitoring indicators .

  • It's grayscale : Any changes are not allowed to be published online without grayscale . For stateless Services , We usually adjust the weight of service discovery or adjust the proportion of the machine to complete the gray scale . For some cases that cannot be randomly grayed , We have designed a mechanism to release the volume in batches according to the users .

  • Roll back : The change system provides a common one click rollback capability , But it's not the fastest way . In many cases , We are ready to remount or remount the machine or cluster to be updated on service discovery before executing the change , The time from problem discovery to recovery is basically seconds .

summary

in summary , The principle and process of upgrading complex systems without downtime can be summarized as follows :

  1. Decoupling and isolation between services , Ensure that the scope and impact of a single upgrade can be controlled .

  2. Determine the order of service upgrade according to compatibility and dependency .

  3. The upgrade method is determined according to whether the service is stateless .

  4. Prepare monitoring and rollback scenarios in advance , Gray scale upgrade .

The whole execution process of the upgrade of the idle fish search service took two months . Among them, we not only ensure that users have no perception , Online services are running steadily , It also ensures that the normal development of the algorithm team and other engineering teams developed with us will not be affected .

In the process of actual execution , We also encountered a lot of details . For example, when creating a new service, you can't reasonably estimate the budget demand in advance , As a result, the budget has been constantly embezzled during the upgrade process , Rob Peter to pay Paul . Another example is the delay problem brought about by the off-site live deployment, which forces the service to remain unitary , It brings a lot of challenges to the traffic control in the upgrade process . These exposed problems also provide guidance for us to continue to improve the architecture and solutions .

author : Free fish technology
Reprinted from the official account
link :https://mp.weixin.qq.com/s/Vc_x-08dGBqM0Ka8vrSy6g

版权声明
本文为[The wind and waves are as calm as a yard]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201206145706148j.html