当前位置:网站首页>It is not only ability but also culture. Talk about the security release of IT system

It is not only ability but also culture. Talk about the security release of IT system

2022-05-14 14:02:37InfoQ

introduction

In the previous article
If only one week , How to quickly improve the stability of the online system ?
In, we summarize the impact IT Factors of system quality and stability and ways to quickly improve system quality and stability . We know that production accidents and architecture design 、 Code development 、 Online change 、 Business configuration 、 The whole life cycle of R & D and operation and maintenance, such as operation and maintenance and external dependence, are related , And the architecture code problem ( Code Bug+ Architecture design ) About occupy 30%, And about 70% The accident occurred between the coding of the program and the actual operation process , It has little to do with the architecture code itself . These non architectural code problems are mainly related to IT Change error 、 Business parameter configuration error 、 Insufficient performance capacity 、 External system dependency errors and IT Operation and maintenance errors and other factors . among IT Business parameter change and version error are related , If combined, the two factors account for... Of all production accidents 20% above , Account for non architectural code problems 1/3 about . Today, let's summarize the relevant practice and experience of security release .

What is a release ?

First , We need to clarify the following definitions first , To make sure we're talking about the same thing as you think .
First of all , What is a release
? Any IT Implemented , What will change the behavior of the production system ( Even if it's just a system parameter setting change ) The collection of and processes is a version release . It includes both the version itself , It also includes the release process .
second , What are the processes involved in a release ?
Since a release includes the release process , What are the processes ? As shown in the figure below , The author's team believes that a version release is from “ Version requirement ” Start , By the time the release is completed and started “ Release tracking ” end , At a minimum, it includes issuing requirements clarification 、 Release strategy development 、 Prepare the version to be released 、 Execute release and release tracking .
null
Third , Version classification
. In the process of practicing the concept of safe release , Different versions of different release sizes need to focus on different contents and security processes . According to the practical experience of the author's team , We roughly classify the versions according to the release scale as follows :
  • Small scale version
    . Release requirements can be a small issue Feature, It could be one BugFix, Generally, it only involves a system module or a microservice itself .
  • Large scale version
    . Releasing requirements can be a major upgrade iteration in technology , For example, the system architecture is upgraded from single application to micro service architecture , Or the data model and database structure have changed greatly . It may also be the last large-scale business launch , For example, a new business starts from 0 To 1 Release , Involving the cooperation of multiple upstream and downstream systems or services , The release itself may be completed in several times .
  • Normal version
    . Between the small-scale version and the large-scale version , May involve one or more ( No more than 3 individual ) Change micro modules or different services , There are no major changes in system architecture or business model , One release can complete the launch of a common version .

How to practice safe Publishing ?

The author's team is responsible for the production accidents directly caused by the release of versions in the past ( About of all accidents 20%) Subdivided , The following data are obtained .
null
The code itself is not included here Bug Production problems caused by , Because when we count production accidents , Put the code Bug The problems are classified separately . According to long-term practice , Based on the concept of version release life cycle and the proportion of the above accident classification , The author's team takes different measures according to the scale of different versions to ensure safe release .
First look at it.
Small version release
. Small scale version release generally does not involve upstream and downstream collaboration , The publishing scheme can be as simple as “ Development - test - Release ” The simple process of . That is to say , The key to small-scale release is to ensure that the well-known code itself is of reliable quality , The online deployment process can be carried out without accidents . Historical accident analysis also tells us , Code quality problems account for at least 30%,“ Change operation error + Dirty test data ” Human operation errors in the process of online deployment also account for at least 10% above . In order to solve these two kinds of problems , There are many excellent practices in the industry . Like the code review between events 、 Strictly enforce the quality threshold 、 Practices such as automated function and performance testing can ensure the quality of the code itself ; The repeatable deployment change pipeline can realize the full automation of the deployment process , Avoid human error . These are all DevOps And the problems to be solved by continuous delivery theory , There are countless related materials , Not expanded here .
Look again.
Normal version release
and
Mass release
. The most essential difference from the general small-scale version release is , These two types of releases involve more system modules or services , More upstream and downstream coordination is needed . As you can see from the above accident classification ,“ Publishing scheme error ” and “ Upstream and downstream coordination ” Relevant change errors and production accidents account for at least 30%, If we only consider the large-scale release that needs to formulate the release scheme and upstream and downstream coordination scheme in detail , The proportion of production accidents caused by such problems will be higher . therefore , For larger releases , In addition to ensuring the release of small-scale versions DevOps And continuous delivery practices , What is more important is to formulate a perfect release strategy and upstream and downstream coordination mechanism , Ensure the safety of release from the source . Underside , Let's focus on these two kinds of problems .

Launch oriented R & D process

Make further analysis on production accidents caused by wrong release scheme or upstream and downstream coordination , We found these
Wrong performance
In fact, it can be roughly divided into the following categories :
  • When dealing with large business requirements, only system architecture changes and development scheduling are considered ,
    The release plan and system architecture are not considered from the perspective of release
  • The release strategy was formulated too late
    , The impact of associated modules can only be identified before the main requirements are released , Cause lameness or delay on the line
  • The system or architecture is too complex ,
    The assessment is not in place or there are omissions in the assessment
    The system goes online with defects
  • The release strategy is too radical
    , Take a one-off BigBang change , The release rhythm and system architecture adjustment are not fully considered from the perspective of business impact
  • The release plan is not considered , Be mere formality
    . There is no emergency fallback plan or the emergency fallback plan cannot be implemented
To solve the above problems , Achieve safe release , It is necessary to clarify the release centered “
Launch oriented R & D process
”. Underside , We take a typical large-scale version release as an example to describe the key content of this process .
example :
In one of the company's main businesses , Running two parallel systems , The old system is a single application based on database in the previous era , The new system is a distributed system based on memory computing . After a period of parallel operation verification , The business party decides to migrate most of the remaining business traffic to the new system , The continuity of existing businesses should be fully considered in the migration process , The delay of business development after migration 、TPS And capacity . New and old systems are not compatible in data , The new system involves multiple data centers .

  • Make an executable release policy and move the policy making link to the left .
For larger versions , After the release requirements are determined, the first thing is to formulate an executable release strategy , Instead of doing architecture design at the beginning , Sometimes you even need to adjust the release requirements according to the release strategy . A good release strategy is a feasible strategy , There should be a clear release content and organization form ( edition ), Release plan , And the organization and manpower guarantee in the implementation of the plan .
  • The release policy contains at least “ Version scheme ” and “ Release plan ” Two parts . Version scheme refers to the system boundary and module components determined according to version requirements . No matter whether the system or module involved has code or configuration modification , As long as it is the system related to completing the corresponding requirements , They are all part of the version scheme . The release plan is for the determined release version , Identify interested parties and determine the complete plan for the implementation of the version . In the initial stage of formulating the release strategy , Maybe you don't need a very clear DayByDay plan , The key point is to identify the relevant parties and corresponding responsibilities , And besides IT Organizational guarantee required to ensure the implementation of version other than system change .
  • The release strategy needs the approval and support of all direct participants . Direct participants are those who participate in “ Version scheme ” All business parties involved in 、 Product Manager 、 R & D team 、 Test team 、 O & M team and infrastructure and security related team .
  • The release strategy should specify a clear person in charge of the release .
Take the business migration requirements in the example as an example . This release includes the old system 、 New system 、 Business data in new and old systems and corresponding database systems 、 Infrastructure such as data center and network running new and old systems . Because the data of new and old systems are incompatible and the new system involves multiple data centers , In addition to paying attention to the transformation that may be involved in the new system itself , The release plan needs to focus on the migration scheme of business data , The guarantee scheme of the underlying infrastructure after the business is migrated to the new system and the construction of the tracking and feedback mechanism after the business is migrated to the distributed system . Obviously , To successfully complete the business migration , The person in charge of the product or the person in charge of the new system who can coordinate multiple parties is the ideal person in charge of the release . At the first time after the above contents are determined , The person in charge of release shall convene all relevant parties to fully discuss the key issues in the plan , Form a coherent plan and action plan .

  • Adjust the architecture design and test scheme according to the release strategy
    .
The main reason for moving the release strategy link to the left before architecture design is , The version scheme formulated in the release strategy often needs the cooperation of architecture design and corresponding test scheme .
For example, the business migration requirements in the example , There are at least two different releases / Migration plan .
Scheme 1 : Migrate business data from the old system to the new system at one time , During the trial operation after migration, ensure that the business data can be transferred back to the old system at one time when the new system has problems .
Architecture design : Focus on designing corresponding fetching schemes and tools to ensure the feasibility of one-time fetching of business data and low time consumption .
Test plan : Focus on testing and rehearsing business data migration and fetch schemes and time-consuming methods .
Option two : Migrate the business data of the old system to the new system in batches , In the customer access link, route according to different customers , During each batch migration, first route the access requests to the new and old systems at the same time , The new system accepts the request and processes it, but does not return a response , For verification purposes only , After verification, switch to the real business flow .
Architecture design : The new customer access gateway supports both new and old systems , Route the service access according to the service batch policy , It also supports changing routing logic during operation ; It is possible that the old system may be modified accordingly to make the business part migration feasible .
Test plan : Focus on the performance and routing stability of the custom gateway , When testing, try to 100% Simulate the real deployment of the target production environment . If the old system also needs architecture modification , Then the corresponding test scheme is also necessary .

  • Organize the owner to review and release the detailed implementation plan as soon as possible
    .
After the relevant modifications of the released version are basically completed , A detailed release implementation plan shall be formulated in parallel with the test and verification, and the release responsible person shall convene all relevant parties for review to lock the release resources or make corresponding adjustments and modifications . The release execution plan should include at least the following basic elements :

When practicing this execution plan , It is better to have a standard for each main content checklist, And in a standard way ( Such as implementation schedule or change management system, etc ) Manage all release and implementation plans as a whole , It is convenient to combine with the release management mechanism and gradually improve .

  • The key links involving more manual coordination and operation shall be fully rehearsed .
As in scheme 1 in the example , The one-time migration and fetching process of business data itself may involve multi-party cooperation , Must be combined with architectural design 、 Migration and retrieval tools, etc , Organize all personnel to conduct drills in advance , Do it a few more times .

  • The completion of publishing execution does not mean the end of publishing , The sign of the end of the release is to track and confirm that there is no error after the release .
Only when “ First run checklist ” All inspection contents in the have been covered and checked and confirmed to be correct , This release is over .

Safety release should become a culture

The author's team has made a lot of exploration 、 Practice and improvement , Finally precipitated some practical experience and methodology . You may have found out , The above safety release advocates “
Launch oriented R & D process
” In fact, it only describes the key links involved in security release , The core work content of each key link , There are no suggestions on how to carry out these works and who will implement them .
in fact , There are many ways to achieve the above security release . The author's team has practiced including change review 、 Accident tracking 、 Accident analysis 、 Practices such as stability oriented exercises and architecture transformation , There are gains and losses . We found that to maximize the effectiveness of the above processes and work practices , The most important factor to maximize the security of publishing is
The person responsible for the release of each version / Whether the implementer can implement the process with high quality
, This includes its own capabilities , It also includes organizational ability . therefore ,
if “ Launch oriented R & D process ” If it can help organizations and people improve the ability of safe release , High quality implementation of the process is a culture of safe release
. To cultivate a culture of safe release, we need to adhere to the following principles :
Principle one
: The person in charge of each version release shall be fully responsible for the safe release of the version
Principle two
: The organization or manager should create an appropriate mechanism so that the version release implementer can take full responsibility
Principle three
: insist “ Launch oriented R & D process ”, Take release security as the R & D process 、 Whether the architecture design is reasonable is an important evaluation standard
Principle four
: Small step run , Try your best to avoid large-scale release , Only do super large-scale releases when you have to

Reference material

  • 《DevOps  Practice Guide 》
  • 《 Continuous delivery 》
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/134/202205141342145066.html

随机推荐