当前位置:网站首页>Experience sharing of scenario system stability assurance

Experience sharing of scenario system stability assurance

2020-12-07 19:19:19 Aliyun yunqi

Every pair 11, How to ensure that the peak of the system can be carried 、 Long term stability is a problem that every big promoter must face . In this year's double 11 Before , Alibaba cloud held an offline exchange in Shanghai , Alida and the person in charge of stability guarantee 、 Middleware experts 、 Solution experts and other experts will share their experience of promotion over the years to the participants , We have selected some of the highlights as follows .


One 、 Observation and Reflection on the stability construction of the Internet industry

The first guest to share is Jiang Yi, senior solution architect of Alibaba cloud's East China Internet team , He has more than ten years of software development experience , In recent years, I have been engaged in the development and architecture of cloud computing , Dominating more than one cloud platform 、PaaS The development and construction of the platform , For the cloud and Internet architecture has a more in-depth understanding and practice , At present, the focus is on containers 、 middleware 、Serverless And so on .


Jiang Jian mentioned in the sharing that , This year we heard a lot of big downtime in the news , The reasons for downtime are very typical , Delete the library and run away 、 Be attacked 、 Lack of capacity planning or flexibility 、 System changes, etc . The consequences of the outage are still serious , Such as SaaS The direct economic loss of service providers is more than 20 million , The market value fell that day 10 Billion ; The market value of a new energy vehicle manufacturer dropped by nearly tens of billions of dollars on the day of network outage . The stock price will come back , But damage to consumer confidence 、 It's hard to get rid of these brands in a short time .

About the current situation of stability construction of the industry , Many enterprises owe a lot on the stability construction , Some smaller and more traditional companies , Probably not ready for high availability . Even for medium and large companies , There are still shortcomings in the construction of stability .


Stability building related work is hard to see 、 To be recognized or judged objectively , No accident can be luck , And even if there's an accident , It's also possible that stability has been done well and ten other major accidents have been avoided . So we need some methods to do some qualitative and even quantitative evaluation for stability construction , Let this work have a goal 、 The process can be followed up 、 The results can test , So we have made some explorations and attempts in this respect .

Here we propose an assumption about the maturity model of stability construction , from 11 Dimensions , Two evaluation methods of stability construction maturity are proposed : One is radar pattern , adopt 11 Scores of indicators , Get a whole score ; The other is hierarchical mode , Each index dimension is based on the degree of perfection of the construction 0~4 branch , We want all companies to be at least above the basic level , Medium and large companies can reach the development level , The leading companies in the industry can reach the mature level .

Of course, the maturity model itself is not perfect , Now put forward for your reference and discussion , In the future, we will continue to optimize , Not only hope to give you a reasonable evaluation reference method , We hope to analyze the overall water level of the industry , Let each family know the water level of their own stability construction in the industry , It's easy to set reasonable goals .


Then give you a quick introduction to some ideas of stability construction , The essence of stability work is nothing more than the process of discovering and eliminating risks , The risk comes from the potential risks left over by its own system and products 、 Long term use of the system leads to the risk of corruption 、 Risks introduced by new function release and system upgrade 、 The risks brought about by activities like big promotion, etc , Our stability job is to make these risks manageable .


Of course, there is also a big tool for security, which is the stability construction system based on Alibaba cloud , Alicloud provides full link stability products and solutions from resources to methodology , We have top customers in the industry , With a small amount of SRE classmate , Based on the various high availability capabilities of alicloud , Provide very efficient, stable and perfect system support .


Two 、 The evolution of e-commerce high availability architecture and sharing of support experience

The second sharing guest is Zhongting, a senior technical expert of Alibaba's high availability architecture team , He's a disaster tolerant man & Failure drill team leader .2011 Joined ali ,2015 He served as a double 11 person in charge , Currently responsible for the protection of high availability areas in Alibaba economy and the export of commercial products .


According to Zhongting , at present , High availability technology products are exported through two cloud services , Namely PTS( Performance pressure test ) and AHAS( Applications are highly available ). Inside Ali , Prepare a double 11 It's a very complex super project , If the business is particularly complex , It may involve dozens or even hundreds of horizontal and vertical projects . But from the technical problem of promoting itself , The problems to be solved include capacity 、 framework 、 Organization, etc . Around these three questions , This paper introduces the history of high tech Pavilion selection , And gives a high availability solution based on cloud :

1. The perfect replication of Ali full link pressure test
(1) The ability of reading, writing and pressure testing of on-line production environment can be obtained by reforming the basic environment of pressure measurement ;

(2) Accumulate basic data of pressure measurement and business flow model experience , You can buy it later PTS The resource package continues to conduct normalized full link pressure test ;

(3) Major events can be easily rehearsed in advance , Prepare and respond in advance .


2. Flow protection
Provide comprehensive availability protection for business system , From gateway protection and application protection two levels 、 entrance / application / Application room / Single machine load is multi-dimensional , Improve the high availability of the system , Including low-cost access , Full protection , Multilingual version support , Second protection .


3. Live in different places
Flexible solutions = Customized technical products + Consulting services + Eco partners .


  1. Trouble shooting
    The professional technology and scheme of chaos engineering : Following the experimental principle of chaos engineering and integrating Alibaba's internal practice , It provides rich fault scenario implementation , Help distributed systems improve fault tolerance and recoverability . Including rich drill Library ( Basic resources 、 application 、 Cloud products ); Scenario drill ( Strength depends on 、 news 、 Database etc. ); Enterprise level practice ( Red and blue attack and defense 、 Asset loss drill, etc ).


3、 ... and 、 Second best practices and solutions

The third guest to share is Lu Xuan, an architect of Alibaba cloud intelligent solutions , He has experienced the development and maintenance of large distributed systems , And in cloud computing 、 Cloud native and other fields have many years of experience , Selection of system architecture , I have rich experience in troubleshooting and performance tuning , It is committed to helping Alibaba cloud customers in various industries realize business value through the transformation of cloud native architecture .


First of all, let's look at the seckill business process , The process is relatively simple , Generally, it is to place an order to reduce the inventory :


The design principles of seckill system include the following :
1 . Hot spot recognition
Through marketing campaigns , Sellers sign up separately and so on , Collect information in advance .

2 . The principle of segregation
On the front page 、 application layer 、 The data layer should be isolated .

3 . Try to intercept the request upstream of the system .

The reason why the traditional seckill system hangs , Requests are all down to the back-end data layer , Data read-write lock conflict is serious , High concurrency and slow response , Almost all requests time out , Although the flow is large , The effective flow of successful orders is very small , For example, a certain commodity only has 1000 Inventory of ,100w I'll buy it myself , In fact, most requests are efficient 0.

4 . Read more and write less scenarios use caching

Seckill is a typical application scenario of reading more and writing less , For example, a certain commodity only has 1000 Inventory of ,100w I'll buy it myself , most 1000 Individual order success , Everyone else is looking at inventory , Write only in proportion to 0.1%, The proportion of reading accounts for 99.9%, Great for caching .


In the second kill scene , There are several things to consider from the architecture level :

1 . Inventory cache

Redis As the main undertaker of inventory deduction during the promotion period . goods ID As Redis Of KEY, Will be available in stock =( Total inventory - Hold back the stock ) Value as Value. utilize LUA The transactional features of scripts are implemented in Redis in “ Deduct after reading the remaining stock ” The logic of

2 . Capacity planning

Use alicloud performance testing tools PTS, Simulate real user requests , Verify the real business operation of national users on the server performance 、 Capacity and system stability , Ensure smooth support for major events .

3 . performance tuning

utilize ARMS Provide three-dimensional monitoring capabilities , In the process of pressure measurement, real-time monitoring application and physical machine indicators , Quickly help developers locate and troubleshoot problems , Improve system performance .


4 . Limit current and prevent brush

Use alicloud application high availability service (AHAS) Realize current limiting and degradation , Make sure the system is not hung up by unexpected traffic . At the same time, the hotspot rules can be configured , Beyond a certain threshold , The system will make the flow of hot goods waiting in line . For example, buy the same product ,1s Inner call exceeds 100 After requests , The rest of the requests wait


5 . Asynchronous decoupling , Peak shaving and valley filling

Message queue RocketMQ Alicloud is based on Apache RocketMQ Low latency of build 、 High concurrency 、 High availability 、 Highly reliable distributed message middleware . Message queue RocketMQ Version 2 can provide asynchronous decoupling and peak and valley cutting capabilities for distributed application systems , At the same time, it also has the massive message accumulation required by Internet applications 、 High throughput 、 Features such as reliable retries


6 . Resilience

For users with periodic promotions , have access to Serverless Application engine (SAE) Rapid deployment application , Using timing flexibility , Automatically expand before the event starts , Automatically shrink and recycle resources after the activity , Maximize the use of resources , And there's no need for human intervention .


Four 、 Best practice experience sharing of full link pressure testing

The fourth guest is alicloud architect , Have 12 year IT Field industry experience , In the energy industry and the Internet ToB The whole industry has experienced and practiced SOA framework 、 Microservice architecture 、 The transformation process of cloud native architecture , On the Internet cloud native architecture and micro service management 、 government 、 Architecture high availability optimization has a deep understanding , Practical experience , Many times, we have helped Alibaba cloud's industry customers complete a comprehensive cloud native transformation of the system architecture .


According to Ji Yuan , Big promotion activities 、 Second kill activity is the best choice to maximize the flow bonus , But many enterprises still can't enjoy the flow dividend , There is only one root cause , That is, the system can not support the impact of large flow . The main problem is that system performance problems are mostly caused by unpredictable problems .

The whole system has a lot of links from front to back , Any link may become the bottleneck of the whole system 、 Short board 、 Constraint point . Different communication protocols , Different data formats , Different norms , Make the whole distributed system architecture extremely complex . in addition , In microservice architecture, service invocation links are very long in north-south and east-west directions , Once a single service goes wrong, it's easy to happen “ Domino ” or “ An avalanche ” effect .


Most products are now an entry point for users 、 One App, But in fact, the content is composed of multiple product lines , The product is presented to customers in coordination with each other . But in practice , Responsible for different modules 、 Teams with different product lines have their own testing teams , They are only responsible for the quality of a module or product line , When these modules are combined , It will produce more because of various matches 、 Problems with collaboration , The so-called can not see a spot and know the whole leopard . These uncertain questions give our product a user experience 、 brand effect 、 Product revenue brings huge challenges .

We have to solve the fundamental problem , All these factors and uncertainties are the means to identify them as much as possible . There are two scenarios in the whole life cycle of a system , Instantaneous flow peak scenario and long-term steady-state scenario .
1 . Instantaneous flow peak scenario **

This scene actually corresponds to the promotion of activities 、 The scene of the seckill activity , We can do full link pressure test in production environment , Maximum simulation of the user's real traffic , Keep pushing up , Find out the performance constraints of the system and optimize it ; And then repeat the process . There are two key points in this process , First, the source of traffic is similar to the real traffic of users , The second is to do pressure measurement in the production environment , This means that we have created a real scene of big promotion activities , To discover the uncertainty of the system .

2 . Long term steady state scenario

Solidify the scheme of full link pressure measurement , Through a unified console , Periodic fault drills , The quality of the release and configuration changes . So we can identify as many uncertain factors as possible through the flood peak scenario , Normalize the uncertainties of the system through long-term steady-state scenarios , Then analyze and solve the uncertainties , To achieve the optimization of system stability and high availability .


In terms of pressure , Alibaba cloud PTS Based on the product edge of the country 、CDN Simulate different regions 、 Each operator initiates traffic , Can launch hundreds of thousands of traffic in a period of time , And can dynamically set the region and operator . stay PTS The console provides a visual way for customers to create business scenarios easily , In addition, it integrates JMeter Native engine , Can quickly import JMeter Script , Seamless migration of pressure testing tools .

In terms of traffic isolation , Alibaba cloud provides non intrusive Agent The way , In the business system does not need to do code transformation at the same time, the ability of traffic isolation is carried on , By means of PTS Flow control console interface Mock Rule configuration 、 Shadow table rule configuration 、 Offset configuration of pressure test data , To achieve Agent Isolation of pressure measurement flow and pressure measurement data .



At present, Alibaba has realized the full-scale cloud native cloud , And through large-scale use, including container services ACK、 Message queue RocketMQ、 Microservices EDAS、 monitor ARMS、 Performance testing PTS Cloud native products, etc , Acquisition costs 、 The dividend of stability and R & D operation and maintenance efficiency improvement . meanwhile , double 11 The greatly promoted business scenario has also become the training ground for the advantages of Alibaba cloud's original technology and products , Create greater value for Alibaba cloud customers .


Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .

本文为[Aliyun yunqi]所创,转载请带上原文链接,感谢