当前位置:网站首页>Revealing the network monitoring technology behind the road of double 11

Revealing the network monitoring technology behind the road of double 11

2020-12-07 19:22:31 Aliyun yunqi

brief introduction :  This article will focus on Hologres Alibaba network monitoring department successfully replaced Druid Best practices , And help double 11 Real time network monitoring large disk millisecond response .

Summary : Just finished 2020 Tmall double 11 in ,MaxCompute Interactive analysis ( Hereinafter referred to as Hologres)+ Real time computing Flink For the first time, the cloud native real-time data warehouse was established in the core data scene , Set a new record for big data platform . On this occasion , We will launch cloud native real-time data warehouse in succession 11 Actual combat series content , This article will focus on Hologres Alibaba network monitoring department successfully replaced Druid Best practices , And help double 11 Real time network monitoring large disk millisecond response .

00:00:00 . The shopping cart , Settlement , place order , payment
00:01:00.... drop , Your Alipay consumption xxx Ten thousand yuan .
Hundreds of billions of people participated in the project at the same time , Record breaking peak 58 Ten thousand brush / second , The choppers were silky smooth throughout the deal , It's like a double fake 11, And behind all this cannot leave the strong support of Alibaba network ability . With the development of technology , Especially in recent years, cloud and e-commerce business has become more and more prosperous , The basic network is becoming larger and more complex , How to ensure the stability of this expanding network , Provide an unobstructed shopping experience for cloud users , It is a great test for network system builders and operators .
In theory , Failure is inevitable , But if you can do it quickly , location , Fix or even prevent failures , Shorten the fault time , The ultimate goal of stability is to let users feel slight or insensible .2015 Microsoft put forward pingmesh, Solutions that become industry facts , But because of some inherent defects , It takes too long to find the fault . Alibaba network R & D business department 2017 We have been developing the detection system at the forefront of the world since AliPing,AliPing The emergence of real-time system brings Alibaba fault detection into second level response , The fastest time delay from data acquisition to data processing to big disk presentation is between seconds , The alarm + Fault location minute level ,7*24 Monitoring the whole network situation of Ali all day long .
AliPling The core architecture of is as follows :

In the whole system , Monitoring the market as the core element of fault detection , Take on the responsibility of presenting the network status in real time , The ups and downs of each curve , It is possible that the business on behalf of users is being damaged , How to display the network status quickly and in real time , And alert / Find network failure , Help users stop bleeding quickly , This is also a major test for the monitoring team's monitoring market . For the monitoring panel used by the monitoring personnel , There are many difficulties :
1) The requirement of data timeliness is high : Need real-time structured data that will be processed ( The alarm , monitor )7*24 Hours of presentation to users (GOC, In front of individual or monitoring personnel , In order to discover and deal with Ali in time + Ant's network failure .
2) Data sources are complex : There are many data sources on the Internet , There are many business scenarios , There are hundreds of G Traffic monitoring data , There are also dozens of minutes K Of IDC network data , How to combine these different kinds of , Business data with different data volumes , When it is included in the monitoring system, it is found that it is abnormal , It is also a test for the overall end-to-end monitoring market .
3) There are many dimensions of data indicators : For the monitors , There are many dimensions of data indicators that need to be monitored , Can be seen as a complex OLAP Query system , How to query the required business data from the market in real time according to your own business scenarios , This is for processing back-end data OLAP The framework is also a major challenge .

Technology selection

For monitoring the market , The combined query conditions of users are unpredictable , There is no way to calculate the structured data in advance , Only pass OLAP( On line analytical processing ) technology , Real time analysis and combination of basic data , And present the results to the user .Aliping The market is actually OLAP Technology embodies , Fault data of different dimensions ( Computer room 、 Area 、DSW、ASW、PSW、 department 、 Application and so on ) In the form of large market, it is displayed in front of users .

2017 In AliPing When the system is implemented , We compared a number of OLAP database , Among them, the representative ones are compared :
Bottom based HDFS Storage , take SQL The sentence is broken down into MapReduce Task to query . Its advantage is low learning cost , You can use the SQL Statement quick to implement simple MapReduce Statistics , You don't have to develop anything special MapReduce application , It is very suitable for statistical analysis of data warehouse . But because the bottom is HDFS Distributed file systems , You can't do the usual CUD( Record operations on the table ) operation , meanwhile Hive It needs to synchronize from the existing database or logs and finally enter HDFS File system , At present, it is very difficult to achieve incremental real-time synchronization . most important of all : Query speed is slow , Can't meet the second level monitoring requirements .
Tradition OLAP According to the different ways of data storage, it can be divided into ROLAP(relational olap) as well as MOLAP(multi-dimension olap).ROLAP Store data for analysis in a relational model , The advantage is that the storage volume is small , The query method is flexible , However, the disadvantages are obvious , Each query needs to aggregate the data , In order to improve the short board ,ROLAP Column store is used 、 Parallel queries 、 Query optimization 、 Bitmap index technology .Kylin The idea of data cube is to trade space for time , By defining a series of latitudes , The combination of each latitude is pre calculated and stored . Yes N Latitude , There will be 2 Of N Sub combination . So it's better to control the number of latitudes , Because the amount of storage increases explosively with latitude , With disastrous consequences . This is a combination of huge network data and uncertain dimensions , It's not acceptable .

This is by Russia yandex company-developed , Designed specifically for online data analysis . According to the official documents ,ClickHouse Number of daily processing records " Billion "( Never tested ). Its mechanism adopts column storage , data compression , Support fragmentation , Support the index , In addition, a computing task will be split and distributed on different partitions for parallel execution , After the calculation, the results will be summarized , Support SQL But the support is not good enough , Support real-time update , Automatic multi copy synchronization . On the whole ,ClickHouse Pretty good , But it's not mature enough , There's not enough official support ,bug Also a lot of , The most important thing is that there is no human use in the group , Can only give up .

It is a data storage system that can provide sub second level queries for historical and real-time data .Druid Support low latency data ingestion , Flexible data exploration and Analysis , High performance data aggregation , Simple horizontal expansion . Suitable for large amount of data , An analytical query system with high scalability requirements . The mechanism stores hot spots and real-time data in real-time nodes (Realtime Node) In the memory , Store historical data in the history node (history node) In the , real time + The structure of pseudo real time , Ensure that queries are basically in milliseconds . High speed intake , Fast query just meets our needs , At the same time, there is strong support from the general computing engine team , In the early days we chose druid As we monitor the market OLAP Support system .

new OLAP Network monitoring system

With the complexity of the business , Business has further increased ,Druid A series of problems are also exposed in the process of using :
1) The bottleneck of data intake , Group on the cloud , The introduction of traffic , So we have a huge amount of data , There have been several major failures in data writing
2) Because of the complexity of the business , We need to add dimension data ,Druid The process of adding is relatively complicated
3)Druid Is not friendly , Has its own query language , about SQL The support is too bad , Waste a lot of time studying
4) High concurrency is not supported , It's a disaster for Dapu . There are two double eleven , We can only kick users online to ensure that the monitoring market is available .

As more and more problems are exposed , We are also looking for a replacement for Druid Solve the current problem , It can meet the real-time requirements OLAP Multi dimensional analysis of scenario requirements for products .
It is also known from the best practices of other departments in the group Hologres, And learn Hologres Support high concurrency point query in row storage mode and real-time in column storage mode OLAP Multidimensional analysis , I think this is very suitable for the requirements of our network monitoring system , So I try to test the experience first Hologres. Through the full link test and a large number of scene data verification , It can meet the needs of our scene , So I decided to go online Hologres To formal production .

After the transformation of the new OLAP The monitoring system is shown in the figure below , The overall data flow is as follows :

  • Kafka Real time collection of network related monitoring index data , And write Flink Medium and light aggregate processing
  • Flink Write the real-time data of the basic grain size after preliminary processing into Hologres in , from Hologres Provide unified storage
  • Hologres Direct real-time docking monitoring screen , The large screen displays the changes of various monitoring indicators in real time , Real time alarm for data not in line with expectations , The corresponding business personnel immediately investigate and solve the problem .

Business value

This year, too Hologres In the first year AIS Network fault monitoring double 11 Fight , As a rookie, he has given us a satisfactory answer sheet . Overall, the value to the business is mainly shown as follows :

1)TB Millisecond response
For real-time monitoring , Time is the lifeline , The faster you find a fault, the faster you stop bleeding , How to combine the complex conditions according to the user input , stay TB In level data , Only second or even millisecond responses are used to filter out the data that meets the requirements (OLAP), This is a big challenge for many systems , And it turns out that , Reasonable utilization Hologres Index function , And through the reasonable allocation of resources , stay OLAP Real time perfect to meet the needs of monitoring business .

2) Support high concurrency
double 11 The monitoring screen often needs to query historical data , And make alarm prediction according to historical data , In the past, the system can only support the query of dozens of users at most ( Count 10 Day data ), and Hologres Can support hundreds of users of large-scale parallel query, and still not reach the upper limit , In this year's double 11 Of 0 a.m. , In the face of hundreds of times the usual data impact , The monitoring curve is still as smooth as ever , There is no sense of stagnation .

3) High write performance
For the last few hundred thousand / second , Millions of / Second write ability ,Druid The performance is not very good, easy to appear the phenomenon of surge plug , and Hologres It can be done easily , This also solves our real-time write bottleneck problem easily .

4) The cost of learning is low
Hologres compatible Postgres, whole SQL Support , It's very convenient for new users , There is no need to spend any more time and energy studying grammar . meanwhile Hologres about BI Good tool compatibility , It doesn't need to be modified to connect the monitoring screen , Save a lot of time .

For every tmall pair 11 The chopper said , Ali's online shopping experience is inseparable from the support of every time , And monitoring the market is the eyes of Alibaba's network situation .Hologres As the core link of the market , Continue to empower the market . however , As a newborn ,HOLO There are still some immature places , In transparent upgrade 、 Stability and other links depend on the existence of space for improvement . We are willing to share with Hologres Growing up together , Looking forward to next year's double 11 Hologres Better performance .

Author's brief introduction : Tang Tang , Under the network R & D business unit network , Now engaged in network stability development research work , Former graduate tutor of Beijing post , Has several patents related to network and algorithm .



Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .

本文为[Aliyun yunqi]所创,转载请带上原文链接,感谢