Experience sharing of scenario system stability assurance
2020-12-07 19:19:19 【Aliyun yunqi】
Every pair 11, How to ensure that the peak of the system can be carried 、 Long term stability is a problem that every big promoter must face . In this year's double 11 Before , Alibaba cloud held an offline exchange in Shanghai , Alida and the person in charge of stability guarantee 、 Middleware experts 、 Solution experts and other experts will share their experience of promotion over the years to the participants , We have selected some of the highlights as follows .
One 、 Observation and Reflection on the stability construction of the Internet industry
The first guest to share is Jiang Yi, senior solution architect of Alibaba cloud's East China Internet team , He has more than ten years of software development experience , In recent years, I have been engaged in the development and architecture of cloud computing , Dominating more than one cloud platform 、PaaS The development and construction of the platform , For the cloud and Internet architecture has a more in-depth understanding and practice , At present, the focus is on containers 、 middleware 、Serverless And so on .
Jiang Jian mentioned in the sharing that , This year we heard a lot of big downtime in the news , The reasons for downtime are very typical , Delete the library and run away 、 Be attacked 、 Lack of capacity planning or flexibility 、 System changes, etc . The consequences of the outage are still serious , Such as SaaS The direct economic loss of service providers is more than 20 million , The market value fell that day 10 Billion ; The market value of a new energy vehicle manufacturer dropped by nearly tens of billions of dollars on the day of network outage . The stock price will come back , But damage to consumer confidence 、 It's hard to get rid of these brands in a short time .
About the current situation of stability construction of the industry , Many enterprises owe a lot on the stability construction , Some smaller and more traditional companies , Probably not ready for high availability . Even for medium and large companies , There are still shortcomings in the construction of stability .
Stability building related work is hard to see 、 To be recognized or judged objectively , No accident can be luck , And even if there's an accident , It's also possible that stability has been done well and ten other major accidents have been avoided . So we need some methods to do some qualitative and even quantitative evaluation for stability construction , Let this work have a goal 、 The process can be followed up 、 The results can test , So we have made some explorations and attempts in this respect .
Here we propose an assumption about the maturity model of stability construction , from 11 Dimensions , Two evaluation methods of stability construction maturity are proposed ： One is radar pattern , adopt 11 Scores of indicators , Get a whole score ; The other is hierarchical mode , Each index dimension is based on the degree of perfection of the construction 0~4 branch , We want all companies to be at least above the basic level , Medium and large companies can reach the development level , The leading companies in the industry can reach the mature level .
Of course, the maturity model itself is not perfect , Now put forward for your reference and discussion , In the future, we will continue to optimize , Not only hope to give you a reasonable evaluation reference method , We hope to analyze the overall water level of the industry , Let each family know the water level of their own stability construction in the industry , It's easy to set reasonable goals .
Then give you a quick introduction to some ideas of stability construction , The essence of stability work is nothing more than the process of discovering and eliminating risks , The risk comes from the potential risks left over by its own system and products 、 Long term use of the system leads to the risk of corruption 、 Risks introduced by new function release and system upgrade 、 The risks brought about by activities like big promotion, etc , Our stability job is to make these risks manageable .
Of course, there is also a big tool for security, which is the stability construction system based on Alibaba cloud , Alicloud provides full link stability products and solutions from resources to methodology , We have top customers in the industry , With a small amount of SRE classmate , Based on the various high availability capabilities of alicloud , Provide very efficient, stable and perfect system support .
Two 、 The evolution of e-commerce high availability architecture and sharing of support experience
The second sharing guest is Zhongting, a senior technical expert of Alibaba's high availability architecture team , He's a disaster tolerant man & Failure drill team leader .2011 Joined ali ,2015 He served as a double 11 person in charge , Currently responsible for the protection of high availability areas in Alibaba economy and the export of commercial products .
According to Zhongting , at present , High availability technology products are exported through two cloud services , Namely PTS（ Performance pressure test ） and AHAS（ Applications are highly available ）. Inside Ali , Prepare a double 11 It's a very complex super project , If the business is particularly complex , It may involve dozens or even hundreds of horizontal and vertical projects . But from the technical problem of promoting itself , The problems to be solved include capacity 、 framework 、 Organization, etc . Around these three questions , This paper introduces the history of high tech Pavilion selection , And gives a high availability solution based on cloud ：
1. The perfect replication of Ali full link pressure test
（1） The ability of reading, writing and pressure testing of on-line production environment can be obtained by reforming the basic environment of pressure measurement ;
（2） Accumulate basic data of pressure measurement and business flow model experience , You can buy it later PTS The resource package continues to conduct normalized full link pressure test ;
（3） Major events can be easily rehearsed in advance , Prepare and respond in advance .
2. Flow protection
Provide comprehensive availability protection for business system , From gateway protection and application protection two levels 、 entrance / application / Application room / Single machine load is multi-dimensional , Improve the high availability of the system , Including low-cost access , Full protection , Multilingual version support , Second protection .
3. Live in different places
Flexible solutions = Customized technical products + Consulting services + Eco partners .
- Trouble shooting
The professional technology and scheme of chaos engineering ： Following the experimental principle of chaos engineering and integrating Alibaba's internal practice , It provides rich fault scenario implementation , Help distributed systems improve fault tolerance and recoverability . Including rich drill Library （ Basic resources 、 application 、 Cloud products ）; Scenario drill （ Strength depends on 、 news 、 Database etc. ）; Enterprise level practice （ Red and blue attack and defense 、 Asset loss drill, etc ）.
3、 ... and 、 Second best practices and solutions
The third guest to share is Lu Xuan, an architect of Alibaba cloud intelligent solutions , He has experienced the development and maintenance of large distributed systems , And in cloud computing 、 Cloud native and other fields have many years of experience , Selection of system architecture , I have rich experience in troubleshooting and performance tuning , It is committed to helping Alibaba cloud customers in various industries realize business value through the transformation of cloud native architecture .
First of all, let's look at the seckill business process , The process is relatively simple , Generally, it is to place an order to reduce the inventory ：
The design principles of seckill system include the following ：
1 . Hot spot recognition
Through marketing campaigns , Sellers sign up separately and so on , Collect information in advance .
2 . The principle of segregation
On the front page 、 application layer 、 The data layer should be isolated .
3 . Try to intercept the request upstream of the system .
The reason why the traditional seckill system hangs , Requests are all down to the back-end data layer , Data read-write lock conflict is serious , High concurrency and slow response , Almost all requests time out , Although the flow is large , The effective flow of successful orders is very small , For example, a certain commodity only has 1000 Inventory of ,100w I'll buy it myself , In fact, most requests are efficient 0.
4 . Read more and write less scenarios use caching
Seckill is a typical application scenario of reading more and writing less , For example, a certain commodity only has 1000 Inventory of ,100w I'll buy it myself , most 1000 Individual order success , Everyone else is looking at inventory , Write only in proportion to 0.1%, The proportion of reading accounts for 99.9%, Great for caching .
In the second kill scene , There are several things to consider from the architecture level ：
1 . Inventory cache
Redis As the main undertaker of inventory deduction during the promotion period . goods ID As Redis Of KEY, Will be available in stock =( Total inventory - Hold back the stock ) Value as Value. utilize LUA The transactional features of scripts are implemented in Redis in “ Deduct after reading the remaining stock ” The logic of
2 . Capacity planning
Use alicloud performance testing tools PTS, Simulate real user requests , Verify the real business operation of national users on the server performance 、 Capacity and system stability , Ensure smooth support for major events .
3 . performance tuning
utilize ARMS Provide three-dimensional monitoring capabilities , In the process of pressure measurement, real-time monitoring application and physical machine indicators , Quickly help developers locate and troubleshoot problems , Improve system performance .
4 . Limit current and prevent brush
Use alicloud application high availability service (AHAS) Realize current limiting and degradation , Make sure the system is not hung up by unexpected traffic . At the same time, the hotspot rules can be configured , Beyond a certain threshold , The system will make the flow of hot goods waiting in line . For example, buy the same product ,1s Inner call exceeds 100 After requests , The rest of the requests wait
5 . Asynchronous decoupling , Peak shaving and valley filling
Message queue RocketMQ Alicloud is based on Apache RocketMQ Low latency of build 、 High concurrency 、 High availability 、 Highly reliable distributed message middleware . Message queue RocketMQ Version 2 can provide asynchronous decoupling and peak and valley cutting capabilities for distributed application systems , At the same time, it also has the massive message accumulation required by Internet applications 、 High throughput 、 Features such as reliable retries
6 . Resilience
For users with periodic promotions , have access to Serverless Application engine （SAE） Rapid deployment application , Using timing flexibility , Automatically expand before the event starts , Automatically shrink and recycle resources after the activity , Maximize the use of resources , And there's no need for human intervention .
Four 、 Best practice experience sharing of full link pressure testing
The fourth guest is alicloud architect , Have 12 year IT Field industry experience , In the energy industry and the Internet ToB The whole industry has experienced and practiced SOA framework 、 Microservice architecture 、 The transformation process of cloud native architecture , On the Internet cloud native architecture and micro service management 、 government 、 Architecture high availability optimization has a deep understanding , Practical experience , Many times, we have helped Alibaba cloud's industry customers complete a comprehensive cloud native transformation of the system architecture .
According to Ji Yuan , Big promotion activities 、 Second kill activity is the best choice to maximize the flow bonus , But many enterprises still can't enjoy the flow dividend , There is only one root cause , That is, the system can not support the impact of large flow . The main problem is that system performance problems are mostly caused by unpredictable problems .
The whole system has a lot of links from front to back , Any link may become the bottleneck of the whole system 、 Short board 、 Constraint point . Different communication protocols , Different data formats , Different norms , Make the whole distributed system architecture extremely complex . in addition , In microservice architecture, service invocation links are very long in north-south and east-west directions , Once a single service goes wrong, it's easy to happen “ Domino ” or “ An avalanche ” effect .
Most products are now an entry point for users 、 One App, But in fact, the content is composed of multiple product lines , The product is presented to customers in coordination with each other . But in practice , Responsible for different modules 、 Teams with different product lines have their own testing teams , They are only responsible for the quality of a module or product line , When these modules are combined , It will produce more because of various matches 、 Problems with collaboration , The so-called can not see a spot and know the whole leopard . These uncertain questions give our product a user experience 、 brand effect 、 Product revenue brings huge challenges .
We have to solve the fundamental problem , All these factors and uncertainties are the means to identify them as much as possible . There are two scenarios in the whole life cycle of a system , Instantaneous flow peak scenario and long-term steady-state scenario .
1 . Instantaneous flow peak scenario **
This scene actually corresponds to the promotion of activities 、 The scene of the seckill activity , We can do full link pressure test in production environment , Maximum simulation of the user's real traffic , Keep pushing up , Find out the performance constraints of the system and optimize it ; And then repeat the process . There are two key points in this process , First, the source of traffic is similar to the real traffic of users , The second is to do pressure measurement in the production environment , This means that we have created a real scene of big promotion activities , To discover the uncertainty of the system .
2 . Long term steady state scenario
Solidify the scheme of full link pressure measurement , Through a unified console , Periodic fault drills , The quality of the release and configuration changes . So we can identify as many uncertain factors as possible through the flood peak scenario , Normalize the uncertainties of the system through long-term steady-state scenarios , Then analyze and solve the uncertainties , To achieve the optimization of system stability and high availability .
In terms of pressure , Alibaba cloud PTS Based on the product edge of the country 、CDN Simulate different regions 、 Each operator initiates traffic , Can launch hundreds of thousands of traffic in a period of time , And can dynamically set the region and operator . stay PTS The console provides a visual way for customers to create business scenarios easily , In addition, it integrates JMeter Native engine , Can quickly import JMeter Script , Seamless migration of pressure testing tools .
In terms of traffic isolation , Alibaba cloud provides non intrusive Agent The way , In the business system does not need to do code transformation at the same time, the ability of traffic isolation is carried on , By means of PTS Flow control console interface Mock Rule configuration 、 Shadow table rule configuration 、 Offset configuration of pressure test data , To achieve Agent Isolation of pressure measurement flow and pressure measurement data .
At present, Alibaba has realized the full-scale cloud native cloud , And through large-scale use, including container services ACK、 Message queue RocketMQ、 Microservices EDAS、 monitor ARMS、 Performance testing PTS Cloud native products, etc , Acquisition costs 、 The dividend of stability and R & D operation and maintenance efficiency improvement . meanwhile , double 11 The greatly promoted business scenario has also become the training ground for the advantages of Alibaba cloud's original technology and products , Create greater value for Alibaba cloud customers .
Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World