Two places, three centers and three places and five centers sound beautiful, but the reality is very cruel

2020-12-07 17:04:42 bystander

Brother Yun 2020 year 12 month 5 Japan


One People always overestimate the speed of technological progress and underestimate the complexity of technology

How to live , Multi-data center , Listen, it's exciting . however …… The reality is cruel .

Let me take you back to this .


2015 year 8 month 12 The big explosion in Tianjin port on Sunday , At that time, Tencent Tianjin data center was away from the explosion center have only 1.5 km ( As shown in the figure below ), It was Tencent's largest cloud computing data center in Asia at that time ,2010 It was just put into operation in , total 8 Thousands of square meters , about 20 10000 servers , Tencent internally calls it “ Tianjin data backup center ”, Carrying wechat 、 Some data requirements of instant messaging and other services .


( The blast wave is huge )


( The distribution of enterprises around the explosion center )


at that time The huge shock wave made this data center “ The door of the diesel generator , The walls are twisted , Very dangerous ” as well as “ The whole cooling system is down , The freezing water pipe burst , The groundwater is seriously flooded ”, The staff also had to leave the scene because of personal safety , If there's an accident in this data center , A lot of the data stored in this data center is gone , Including your circle of friends data , At that time, Tencent did not make a backup ( May not , It's backup itself , The backup almost blew up ).


( The next day is 8 month 13 Japan , And then the employees took it with them N layer N95 The masks are back in the data center )


After that , Tencent attaches great importance to data security . Ma Huateng was given advice , We can build a big data disaster recovery center in Guizhou , Because Guizhou has many advantages : Sufficient water and electricity , Electricity is cheap ; It has many caves , Constant temperature and humidity . So Tencent set up a new one in Guizhou , Larger data centers .


2015 Alipay didn't go anywhere in 2009. :

2015 year 5 month 27 Japan , Due to municipal construction Alipay Hangzhou data center optical cable , Although Alipay's cellular architecture is basically shaped by disaster recovery , But there are still many practical problems , It took hours to complete the switch 、 Recovery services . Although the data is not wrong , But for a company of this size , The impact of public opinion on the unavailability of services is also very large .


therefore 527 This number , Become the gall hanging in the heart of ant technology ,5 month 27 Day is designated as the ant's technology day , Be alert at all times and be awed by technology , Keep polishing technology .


later , The ants are there 2018 At the cloud habitat conference, three places and five activities were demonstrated “ Cut the wire ”( As shown in the figure below ), It has reached the top level of China's Internet .


( Operation interface of live demonstration )


here we are 2017 year 5 month 9 Japan , Are you hungry CTO Zhang Xuefeng announced that he was hungry (Multi-Active IDCs/Regions) put it over , Realize the first live production environment whole network switching ( Grayscale ). Zhang Xuefeng also said , As far as he knows , Domestic daily average ( Off peak or high acceleration period ) Order 100 Trading platform with more than 10000 transactions , In addition to Alibaba's real sense of the realization of the whole network live ( It's not double living ), By then, there should have been no other company that could have done it .

In his opinion , The difficulty lies in technology and implementation . technical , The most important thing is the strong consistency of real-time data , Especially for the delivery delivery of this kind of real-time business scenarios . Implementation , The biggest challenge is to fly at high speed ( Fast product iteration ) Engine replacement for aircraft in .


So 2018 In, we independently realized the remote double living , It can be classified by merchant by second level 、 By city 、 By province 、 By machine room 、 Switch merchant traffic by device at any time , Merchant 、 Users and cashiers have no perception , It's not too late .


Two There is a plan , and , Dare you switch , Two different things.

Speaking of 2015 Tianjin Port explosion in 1.5 The data security of wechat data center is in danger , Speaking of 2015 year 5 month 27 The optical fiber cable of Alipay Hangzhou data center was dug off, resulting in Alipay taking several hours to complete the handover recovery service , Some people are happy to say that banking did it 20 years ago . I don't know what I've done , Multi-data center ? How to live ?


The banking system, though, was a long time ago ( The earliest is in the year 2000 ) On the establishment of a sound disaster recovery system ( Note that “ Disaster preparedness ”, No “ How to live ”), But because I haven't done the system drill normally , When a truly catastrophic event occurs , No one dares to switch the system easily , I'm afraid I can't switch back after switching .


Such things happen again and again : The major accident and disaster recovery center of a joint-stock commercial bank with system outage business suspended for one day has not been completed 、 The president dare not order to switch the business system to the disaster recovery system . Ten years ago , Even five years ago , The main work of business continuity management of China's financial institutions is still in IT On the technical level of system disaster recovery .


Here are three cases .


first ,2010 year 2 month 3 Japan , From more than ten in the morning to the afternoon 15:30, The whole bank system of Minsheng Bank is paralyzed , The failure lasted more than four hours .

At that time, someone tweeted that , The accident was caused by IT Database of an application system in a department ( Should be Informix, The database version is old and there is no normal maintenance service ), A long task that should be handled at night , The operation is not finished until the bank opens , When the system is normal CPU The usage rate has arrived 70-80%, Long missions run from night to morning and can't stop , Slow down an already overburdened business system to an intolerable level , Due to database version EOS( namely End of Service), So there is no tool support from the manufacturer's lab , In desperation , Restart the relevant system , As a result, business stopped . So in this case , What's the use of multiple data centers , The load is so high , Where to cut and where to fall .


the second ,2013 year 6 month 23 The morning of , ICBC counter 、ATM、 Failure of online banking business , Keep close to 1 Hours . As a service at that time 2.92 100 million private customers and 400 The national financial service giant with more than ten thousand corporate clients , ICBC's failure affected Beijing 、 Shanghai 、 Guangzhou 、 wuhan 、 Harbin and other large and medium-sized cities , hereinafter referred to as “6·23 event ”. On the day , ICBC described the accident as fuzzy :“ Due to the upgrading of computer system in some areas of ICBC, the counter and electronic channel business is handled slowly .” So far, ICBC has 6·23 The only public explanation of the event to the user .



As shown above , Information Technology Department of ICBC 6·23 The incident was officially notified internally , The bulletin says , ICBC data center ( Shanghai ) Host system failure , Is due to IBM Host provided DB2V10 This is caused by a defect in the version memory cleaning mechanism . So what about multiple data centers ?


The reason is simple ,2013 In 2005, ICBC did not have the one button switching capability :

2004 year , Industrial and Commercial Bank of China took the lead in establishing “ Two centers in two places ” Remote disaster recovery architecture of data center based on ;

2009 year , ICBC starts “ Three centers in two places ” Research on new architecture of data center ;

2014 year , Shanghai Jiading data center was officially put into operation , ICBC is the first in the industry to successfully realize the whole business switching operation of the data center in the same city , Mark the “ Three centers in two places ” Initial success of the project ;

2014 year 11 month , For the first time, ICBC adopted the method of interim notice , Successful implementation of switching operation of core systems in the same city , The implementation process adopts “ One button ” Switch tools , The switching time of the host core system is controlled at the minute level .


See? ? Disaster preparedness , and , Dare you switch , Two different things. . Live with other places , Follow 2018 Ant demonstration in “ Three places five centers ”“ Cut the wire ”, It's a different story .


Third ,2011 year 4 month 12 On the afternoon of Sunday , South Korea's largest bank, the Agricultural Association bank (NH Bank) The whole network is paralyzed , The trouble has been going on 3 God , until 4 month 15 Part of the service will be resumed in the next few days , And some services until 4 month 18 Japan still hasn't recovered , So that banks have to use the traditional handwritten transaction form for service .


According to the bank staff of the Agricultural Association 、 Korean prosecutors 、 Preliminary investigation by investigators from the financial supervision institute and the Central Bank of China ,4 month 12 On the afternoon of Sunday 4:30 To 5 Between points ,“ Sb. ” By outsourcing an employee's notebook on the bank's core system 275 Servers are down rm.dd command , This command will delete all files on the server . The deleted server contains the server used to restart the system . The result was that afternoon 5:30 Start around , The bank is in the country 1154 Service interruption of branches .


rm.dd Is the highest level system command , Only those with the highest security rights Super Root Only users have permission to execute , And only specific to the bank intranet IP paragraph .


During the accident , The relay agent server in Liangcai and the disaster recovery server in Ancheng are all invalid , The result is that it can only be done by giving 553 Server (s) reloading system to restore service ……&¥%!


The owner of the notebook said that the deletion order was not issued by himself . At the time of the incident , The employee's notebook is placed in the bank's office . According to the monitoring video of the day , Yes 20 Personal access to this notebook , this 20 One of us does have it Super Root jurisdiction .


however , It doesn't rule out hackers connecting to this laptop from the Internet , Then through this notebook as a springboard to the server to issue instructions , Because the notebook is on the 24 It is connected to the Internet within hours .

In this incident , Did disaster preparedness work ? No, . The disaster preparedness has been eliminated .


