当前位置:网站首页>Simulation driving ability output, enabling customers to improve stability confidence

Simulation driving ability output, enabling customers to improve stability confidence

2020-12-07 19:19:01 Aliyun yunqi

brief introduction :  Simulated driving ability output , Enabling customers to enhance stability confidence

1.png

1. background

Our technical service team often encounters this situation , Get an emergency call anytime, anywhere , Start to nervously investigate the problem , Troubleshooting and service recovery . Hard disk failure 、 The Internet is not working 、 A lot less than the final state 、 High water level 、 There are so many problems such as traffic surge , Maybe because of a small change , Because of some unexpected situation , Cause butterfly effect , It led to a massive system chaos 、 Failures and service interruptions , Have a serious impact on the business of the customer . Failure often brings great loss , However, due to the characteristics of Distributed Systems , All kinds of unexpected emergencies are inevitable , Manpower alone can't stop it 100 percent . Instead of worrying about what problems will happen to the system , It's better to turn passivity into initiative , Simulate all kinds of situations that may occur in the online environment in advance , To test whether our system can be fault tolerant , Whether it still has the ability to provide services when problems arise .
The original intention of driving simulation is to use experimental methods , Take the initiative to find the weak links in the system , Let people build up the confidence of complex distributed system to resist emergencies in production . Any system that is complex enough inevitably has unexpected hidden debts , Hidden debt is a byproduct of the gradual complexity of modern software systems , It will threaten the normal operation of the system . The point of driving simulation is that , It will help you discover hidden debt , And deal with it before it becomes a serious problem in the system , To avoid causing damage .

2. Driving simulation

Simulation driving is GTS-SRE Chaos engineering team follows the principle of chaos , And the integration of the team's internal experience of many years of high availability system to create a technical service , Provide rich fault scenarios 、 Implementation of exception simulation , It can help distributed systems improve fault tolerance and recoverability , In order to help more governments and enterprises do a good job in stability construction .

2.1 Basic service content

 

2.png

chart 1: Basic service content framework of driving simulation

2.1.1 Chaos engineering course training

The training course on chaos engineering involves three topics , There are eight chapters ,20 The content of the section . The course starts with the theory of chaos engineering , Integrating Ali's internal practice , The classic cases are summarized from the historical fault database and fault capability database of hybrid cloud , How to complete the experiment of chaos engineering is explained in simple terms , Including the challenges that need to be faced 、 The preparation of the project 、 The tools used and the implementation method of driving simulation , It's an introduction to chaos engineering practice . Learning through the course , Can help users achieve the following goals :

  • Understand the basic concepts of chaos engineering 、 principle 、 Premise 、 Function and Application ;
  • A toolkit for understanding hybrid clouds in chaos engineering practice ;
  • Understand the specific implementation method of chaos engineering experiment ;
  • Understand the operation method of simulated driving drill in each scene ;
  • Participate in the community co construction of hybrid cloud chaos project .

2.1.2 Driving simulator Kit

  • Experimental injection tools -Apsara Chaos Platform
    Apsara Chaos Platform( abbreviation ACP) It is a hybrid cloud experimental injection tool that follows the principles of chaotic engineering and chaotic experimental model , Help enterprises improve the redundancy and fault tolerance of distributed systems 、 Fault isolation capability 、 Observability , And in the process of enterprise cloud or cloud native system migration, it provides the guarantee of flexibility and ease of use .ACP It can support rich abnormal simulation scenarios , Including boundary anomaly simulation ( Such as docker Abnormal etc. )、 Application layer simulation ( If the process is suspended 、 Abnormal exit, etc )、 System layer simulation ( Such as CPU、 Memory 、 Disk space and other system resources consumption )、 Hardware layer simulation ( Such as network card jitter 、 Out of band restart, etc ), And the exception type node arranged in the exception scenario , This simulation capability can be instantiated .
  • Monitoring alarm tool -TAM Alarm Center
    TAM Alarm Center( abbreviation TAC) yes SRE The team has carefully built a one-stop alarm operation and maintenance platform for hybrid cloud , Covering cloud products involved in hybrid cloud 、 big data 、 Cloud instance and the site application involved by users , Provide alarm lifecycle management and alarm sending solutions . Help hybrid clouds quickly discover 、 Locating anomalies . at present , China proper 100+ The project has been deployed TAC, Among them is 40%+ The project can be nailed 、 Send out alarm by SMS or email . The efficiency of alarm processing is improved effectively , Reduce the failure caused by the alarm not handled in time , Greatly improve the quality of project operation and maintenance , Reduce project labor cost .
  • Fault diagnosis tools -SRE-CLI
    Site Reliability Engineer - Command-Line Interface( abbreviation SRE-CLI) Is a black screen console fault diagnosis tool , It can automatically diagnose and confirm the cause of the exception by matching the check items from the scene library according to the abnormal scene , Archive exception details , Finally, some suggestions are given .SRE-CLI It can help hybrid cloud quickly discover 、 Locate the cause of the abnormal problem , Provide description from problem 、 Fault location and diagnosis 、 Output diagnostic report 、 Provide a full range of solutions “ Interrogation ” service . at present , Cloud platform base products space-based 40% The question has been able to pass through SRE-CLI Tools to locate the diagnosis , Thus, the time of manual investigation is reduced , It effectively improves the efficiency of fault handling , Greatly improve the quality of project operation and maintenance , Reduce project labor cost .
  • Inspection and problem diagnosis tools - dongjak
    Tongque focuses on Intelligent Inspection and problem diagnosis , yes TAM And the primary tool in the daily work of the on-site service team . It can make TAM And the on-site service team were liberated from the daily tedious inspection work , Put your energy into more valuable customer service , By getting through the cloud platform side 、 Information on the tenant side and the application side , Auxiliary application operation and optimization , And improve the ability and speed of on-site problem analysis and positioning by means of tool . Copper sparrow is currently the standard output product for hybrid cloud Enterprise Edition , Has been able to 100% Cover V3 The point of the platform . At present, the main function of the copper sparrow is to patrol , In the future, the failure will be solved gradually 、 High frequency change 、 The ability to diagnose problems is added , And open up the basic abilities of the copper finch , Make products 、 On the spot 、TAM And so on, the experience of the personnel continuously precipitates into the system , Create an operation and maintenance ecosystem centered on Tongque .

2.2 Service cases

2.2.1 background

A power group customer wants to verify the high availability of cloud platform and Alibaba's emergency response capability , After understanding that SRE-TAM The team has professional and perfect driving simulation ability , So I applied for the service , Invite the technical service team to the business site for enabling training and drill services .

2.2.2 The goal is

The technical service team hopes to enhance customers' awareness of the concept of chaos engineering and cloud platform through this service , At the same time, it can help students and customers to improve their emergency handling ability , Boost platform stability confidence .

2.2.3 Service Overview

  • The technical service team investigates and collects project site information ;
  • Empower customers , Teach chaos engineering training course ( Including the overview of chaos Engineering 、 Introduction to hybrid cloud chaos toolkit 、 Three topics of driving simulation );

    3.png

    chart 2: Customer on site empowerment training

  • Prepare and deploy tools , Deployed on the business site ACP and SRE-CLI Tools , For deployed TAC The tool has been upgraded ;
  • The customer provides a time window for the drill , The technical service team follows the standard process ( Preliminary inspection 、 fault injection 、 Alarm check 、 Troubleshooting & recovery ) Five scenarios of driving simulation were conducted ;
  • Guide customers to use tools for operation drill and answer questions ;

    image.png

    chart 3: Customers guide the operation and answer questions on site

2.2.4 results

Through this simulated driving service , It strengthens the customer's cognition of chaos engineering experiment and deepens the customer's understanding of the high availability of cloud platform :

  • Through hand-in-hand guidance and cooperation with customers from fault injection to fault detection and finally to the whole process of troubleshooting practice , Customers have a clear sense of the impact range and explosion radius of the simulated driving , We have a certain understanding of the principles of chaos engineering ;
  • After the driving demonstration , Customers deeply recognize the self-healing ability and robustness of the cloud platform , And normalize the later period 、 The drill intention of institutionalization is relatively strong ;
  • Customers are happy to participate in the co construction of hybrid cloud simulation driving .

3. Summary and reflection

The great man said “ Practice is the only criterion for testing truth ”, However, in the process of developing a simulated driving trip , It is impossible to achieve the ideal state of the experiment in one move , What path can we follow to implement the first simulated driving experiment ?
First , The so-called experimental driving simulation is to explore “ Unknown ” The risk of , Find hidden debt , But in practice, we might as well start from “ It is known that ” Start . Through to “ Where is the most likely problem ” Thinking and discussion of , We assess the potential weaknesses of the system and the expected results , This gives you a sense of the priority of the drill : Which potential problems are more likely to occur or have more serious consequences . The team can record and summarize the historical fault types 、 Frequency of occurrence and corresponding dependencies , Thus to “ Where is the most likely to go wrong ” Have a preliminary understanding of . When you think something might be wrong , It would be a good start to inject faults from such scenarios first .
secondly , We need to create a hypothesis before the experiment is implemented , It's going to be a great team thinking exercise . By discussing the scene , You can make assumptions about the expected results before running it , For example, this failure to customers 、 What impact does the business have on your dependencies ? After running the first experiment , You may encounter one of two outcomes : It is verified that the system is resilient to the fault introduced , Or find problems that need to be fixed . Both results are good . If it's the first case , You increase confidence in the system and its behavior ; If it's the latter , You find problems before the system causes downtime .
Last , Simulation driving experiment advocates “ The closer you are to the production environment, the better, the more realistic ”, But we think from a practical point of view , It depends on the organization's acceptance of the idea . So, a relatively mild experimental path for all parties is to gradually go from offline to production . But for distributed systems , Deployment is different 、 Different traffic will bring different results , Only in the production environment can we really verify , Otherwise, the value of these practices will be greatly weakened .
Just as human beings need to be vaccinated to avoid unknown diseases , The stability of the system is also inseparable from the driving simulation , It is hoped that our team can more output the ability of driving simulation technology , Give more customers , To enhance confidence in anti vulnerability and system stability !

 

 

Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .

版权声明
本文为[Aliyun yunqi]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/11/20201112221016743j.html