Simulation driving ability output, enabling customers to improve stability confidence
2020-12-07 19:19:01 【Aliyun yunqi】
brief introduction ： Simulated driving ability output , Enabling customers to enhance stability confidence
Our technical service team often encounters this situation , Get an emergency call anytime, anywhere , Start to nervously investigate the problem , Troubleshooting and service recovery . Hard disk failure 、 The Internet is not working 、 A lot less than the final state 、 High water level 、 There are so many problems such as traffic surge , Maybe because of a small change , Because of some unexpected situation , Cause butterfly effect , It led to a massive system chaos 、 Failures and service interruptions , Have a serious impact on the business of the customer . Failure often brings great loss , However, due to the characteristics of Distributed Systems , All kinds of unexpected emergencies are inevitable , Manpower alone can't stop it 100 percent . Instead of worrying about what problems will happen to the system , It's better to turn passivity into initiative , Simulate all kinds of situations that may occur in the online environment in advance , To test whether our system can be fault tolerant , Whether it still has the ability to provide services when problems arise .
The original intention of driving simulation is to use experimental methods , Take the initiative to find the weak links in the system , Let people build up the confidence of complex distributed system to resist emergencies in production . Any system that is complex enough inevitably has unexpected hidden debts , Hidden debt is a byproduct of the gradual complexity of modern software systems , It will threaten the normal operation of the system . The point of driving simulation is that , It will help you discover hidden debt , And deal with it before it becomes a serious problem in the system , To avoid causing damage .
2. Driving simulation
Simulation driving is GTS-SRE Chaos engineering team follows the principle of chaos , And the integration of the team's internal experience of many years of high availability system to create a technical service , Provide rich fault scenarios 、 Implementation of exception simulation , It can help distributed systems improve fault tolerance and recoverability , In order to help more governments and enterprises do a good job in stability construction .
2.1 Basic service content
chart 1： Basic service content framework of driving simulation
2.1.1 Chaos engineering course training
The training course on chaos engineering involves three topics , There are eight chapters ,20 The content of the section . The course starts with the theory of chaos engineering , Integrating Ali's internal practice , The classic cases are summarized from the historical fault database and fault capability database of hybrid cloud , How to complete the experiment of chaos engineering is explained in simple terms , Including the challenges that need to be faced 、 The preparation of the project 、 The tools used and the implementation method of driving simulation , It's an introduction to chaos engineering practice . Learning through the course , Can help users achieve the following goals ：
- Understand the basic concepts of chaos engineering 、 principle 、 Premise 、 Function and Application ;
- A toolkit for understanding hybrid clouds in chaos engineering practice ;
- Understand the specific implementation method of chaos engineering experiment ;
- Understand the operation method of simulated driving drill in each scene ;
- Participate in the community co construction of hybrid cloud chaos project .
2.1.2 Driving simulator Kit
- Experimental injection tools -Apsara Chaos Platform
Apsara Chaos Platform（ abbreviation ACP） It is a hybrid cloud experimental injection tool that follows the principles of chaotic engineering and chaotic experimental model , Help enterprises improve the redundancy and fault tolerance of distributed systems 、 Fault isolation capability 、 Observability , And in the process of enterprise cloud or cloud native system migration, it provides the guarantee of flexibility and ease of use .ACP It can support rich abnormal simulation scenarios , Including boundary anomaly simulation （ Such as docker Abnormal etc. ）、 Application layer simulation （ If the process is suspended 、 Abnormal exit, etc ）、 System layer simulation （ Such as CPU、 Memory 、 Disk space and other system resources consumption ）、 Hardware layer simulation （ Such as network card jitter 、 Out of band restart, etc ）, And the exception type node arranged in the exception scenario , This simulation capability can be instantiated .
- Monitoring alarm tool -TAM Alarm Center
TAM Alarm Center( abbreviation TAC） yes SRE The team has carefully built a one-stop alarm operation and maintenance platform for hybrid cloud , Covering cloud products involved in hybrid cloud 、 big data 、 Cloud instance and the site application involved by users , Provide alarm lifecycle management and alarm sending solutions . Help hybrid clouds quickly discover 、 Locating anomalies . at present , China proper 100+ The project has been deployed TAC, Among them is 40%+ The project can be nailed 、 Send out alarm by SMS or email . The efficiency of alarm processing is improved effectively , Reduce the failure caused by the alarm not handled in time , Greatly improve the quality of project operation and maintenance , Reduce project labor cost .
- Fault diagnosis tools -SRE-CLI
Site Reliability Engineer - Command-Line Interface（ abbreviation SRE-CLI） Is a black screen console fault diagnosis tool , It can automatically diagnose and confirm the cause of the exception by matching the check items from the scene library according to the abnormal scene , Archive exception details , Finally, some suggestions are given .SRE-CLI It can help hybrid cloud quickly discover 、 Locate the cause of the abnormal problem , Provide description from problem 、 Fault location and diagnosis 、 Output diagnostic report 、 Provide a full range of solutions “ Interrogation ” service . at present , Cloud platform base products space-based 40% The question has been able to pass through SRE-CLI Tools to locate the diagnosis , Thus, the time of manual investigation is reduced , It effectively improves the efficiency of fault handling , Greatly improve the quality of project operation and maintenance , Reduce project labor cost .
- Inspection and problem diagnosis tools - dongjak
Tongque focuses on Intelligent Inspection and problem diagnosis , yes TAM And the primary tool in the daily work of the on-site service team . It can make TAM And the on-site service team were liberated from the daily tedious inspection work , Put your energy into more valuable customer service , By getting through the cloud platform side 、 Information on the tenant side and the application side , Auxiliary application operation and optimization , And improve the ability and speed of on-site problem analysis and positioning by means of tool . Copper sparrow is currently the standard output product for hybrid cloud Enterprise Edition , Has been able to 100% Cover V3 The point of the platform . At present, the main function of the copper sparrow is to patrol , In the future, the failure will be solved gradually 、 High frequency change 、 The ability to diagnose problems is added , And open up the basic abilities of the copper finch , Make products 、 On the spot 、TAM And so on, the experience of the personnel continuously precipitates into the system , Create an operation and maintenance ecosystem centered on Tongque .
2.2 Service cases
A power group customer wants to verify the high availability of cloud platform and Alibaba's emergency response capability , After understanding that SRE-TAM The team has professional and perfect driving simulation ability , So I applied for the service , Invite the technical service team to the business site for enabling training and drill services .
2.2.2 The goal is
The technical service team hopes to enhance customers' awareness of the concept of chaos engineering and cloud platform through this service , At the same time, it can help students and customers to improve their emergency handling ability , Boost platform stability confidence .
2.2.3 Service Overview
- The technical service team investigates and collects project site information ;
- Empower customers , Teach chaos engineering training course （ Including the overview of chaos Engineering 、 Introduction to hybrid cloud chaos toolkit 、 Three topics of driving simulation ）;
chart 2： Customer on site empowerment training
- Prepare and deploy tools , Deployed on the business site ACP and SRE-CLI Tools , For deployed TAC The tool has been upgraded ;
- The customer provides a time window for the drill , The technical service team follows the standard process （ Preliminary inspection 、 fault injection 、 Alarm check 、 Troubleshooting & recovery ） Five scenarios of driving simulation were conducted ;
- Guide customers to use tools for operation drill and answer questions ;
chart 3： Customers guide the operation and answer questions on site
Through this simulated driving service , It strengthens the customer's cognition of chaos engineering experiment and deepens the customer's understanding of the high availability of cloud platform ：
- Through hand-in-hand guidance and cooperation with customers from fault injection to fault detection and finally to the whole process of troubleshooting practice , Customers have a clear sense of the impact range and explosion radius of the simulated driving , We have a certain understanding of the principles of chaos engineering ;
- After the driving demonstration , Customers deeply recognize the self-healing ability and robustness of the cloud platform , And normalize the later period 、 The drill intention of institutionalization is relatively strong ;
- Customers are happy to participate in the co construction of hybrid cloud simulation driving .
3. Summary and reflection
The great man said “ Practice is the only criterion for testing truth ”, However, in the process of developing a simulated driving trip , It is impossible to achieve the ideal state of the experiment in one move , What path can we follow to implement the first simulated driving experiment ？
First , The so-called experimental driving simulation is to explore “ Unknown ” The risk of , Find hidden debt , But in practice, we might as well start from “ It is known that ” Start . Through to “ Where is the most likely problem ” Thinking and discussion of , We assess the potential weaknesses of the system and the expected results , This gives you a sense of the priority of the drill ： Which potential problems are more likely to occur or have more serious consequences . The team can record and summarize the historical fault types 、 Frequency of occurrence and corresponding dependencies , Thus to “ Where is the most likely to go wrong ” Have a preliminary understanding of . When you think something might be wrong , It would be a good start to inject faults from such scenarios first .
secondly , We need to create a hypothesis before the experiment is implemented , It's going to be a great team thinking exercise . By discussing the scene , You can make assumptions about the expected results before running it , For example, this failure to customers 、 What impact does the business have on your dependencies ? After running the first experiment , You may encounter one of two outcomes ： It is verified that the system is resilient to the fault introduced , Or find problems that need to be fixed . Both results are good . If it's the first case , You increase confidence in the system and its behavior ; If it's the latter , You find problems before the system causes downtime .
Last , Simulation driving experiment advocates “ The closer you are to the production environment, the better, the more realistic ”, But we think from a practical point of view , It depends on the organization's acceptance of the idea . So, a relatively mild experimental path for all parties is to gradually go from offline to production . But for distributed systems , Deployment is different 、 Different traffic will bring different results , Only in the production environment can we really verify , Otherwise, the value of these practices will be greatly weakened .
Just as human beings need to be vaccinated to avoid unknown diseases , The stability of the system is also inseparable from the driving simulation , It is hoped that our team can more output the ability of driving simulation technology , Give more customers , To enhance confidence in anti vulnerability and system stability ！
Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World