当前位置:网站首页>Monitoring system selection, this article must read!

Monitoring system selection, this article must read!

2020-11-10 10:44:17 Career advancement of it people

Before , I've written a few articles about 「 Online troubleshooting 」 The article , Some monitoring charts are attached , Some readers are interested in this , Ask me if I have any good suggestions on the selection of monitoring system ?

Several companies I've been through at the moment , The monitoring system is self-developed . In fact, the industry has many excellent open source products to choose from , It can meet most of the monitoring requirements , If you can choose one of them to meet the current demands of enterprises , Obviously the most time-saving and labor-saving .

This article , I will have a basic knowledge of monitoring system 、 Make a systematic arrangement of the principle and architecture , At the same time, it will introduce some of the most commonly used open source monitoring products , For your reference when selecting models . The content includes 3 part :

  • Must know must know the basic knowledge of monitoring
  • Introduction to mainstream monitoring system
  • Suggestions for the selection of monitoring system

01 Must know must know the basic knowledge of monitoring

Monitoring system is commonly known as 「 The third eye 」, It's a system that we deal with almost every day , below 4 I think it is necessary to understand the basic knowledge .

1. Supervisory system 7 Great effect

As the saying goes 「 There is no monitoring , No operation and maintenance 」, The status of monitoring system is self-evident . Whether you're a developer or a user of a surveillance system , First of all, make sure that : What is the goal of the monitoring system ? What can it do ?

  • Real time acquisition of monitoring data : Include Hardware 、 operating system 、 middleware 、 Applications and other dimensions of data .

  • Real time feedback monitoring status : Through the multi-dimensional statistics and visualization of the collected data , It can reflect whether the state of the monitoring object is normal or abnormal in real time .

  • Predict faults and alarms : Be able to predict the risk of failure in advance , And send out alarm information in time .

  • Auxiliary positioning fault : Provide the index data when the fault occurs , Auxiliary fault analysis and location .

  • Auxiliary performance tuning : Provide data support for performance tuning , Like slow SQL, Interface response time, etc .

  • Auxiliary capacity planning : For the server 、 Middleware and application cluster capacity planning provides data support .

  • Operation and maintenance automation : For automatic expansion or as configured SLA Provide data support for intelligent operation and maintenance such as service degradation .

2. Use the correct posture of the monitoring system

Any online incident , Let's not say that there are problems elsewhere , There must be something wrong with the monitoring section .

Listen to a word that shakes the pot , There seems to be a point in thinking carefully . When we come back to the accident , I usually think about this 3 A monitoring problem : There's no monitoring ? Whether the monitoring is timely ? Whether monitoring information helps to locate problems quickly ?

A good monitoring system for visible light is not enough , You have to know 「 How to use it well 」. A mature R & D team usually sets a monitoring specification , How to use the unified monitoring system .

  • Understand how monitoring objects work : To achieve a basic understanding of the monitoring object , Understand how it works . For example, if you want to be right about JVM monitor , You have to be clear about JVM Heap memory structure and garbage collection mechanism of .
  • Determine the target of monitoring : Be clear about which indicators are used to characterize the state of the monitored object ? For example, you want to monitor an interface , You can use the amount of requests 、 Time consuming 、 Timeout amount 、 Abnormal quantity and other indicators to measure .
  • Define reasonable alarm thresholds and levels : What threshold should be reached ? What is the corresponding fault level ? Alarms that do not need to be handled are not good alarms , We can see how important it is to define a reasonable threshold , Otherwise, it will only reduce the operation and maintenance efficiency or make the monitoring system lose its function .
  • Establish a complete troubleshooting process : After receiving the fault alarm , There must be a corresponding process flow and oncall Mechanism , Let the fault be followed up in time .

3. What are the monitoring objects and indicators ?

Monitoring has become a very important part of the whole product life cycle , Operation and maintenance focuses on hardware and basic monitoring , R & D focuses on monitoring all kinds of middleware and application layer , Products focus on the monitoring of core business indicators . so , The object of monitoring has become more and more stereoscopic .

here , I have classified and sorted out the commonly used monitoring objects and monitoring indicators , For your reference .

3.1 Hardware monitoring

Include : Power status 、CPU state 、 Machine temperature 、 Fan status 、 Physical disks 、raid state 、 Memory status 、 Network card status

3.2 Server basic monitoring

  • CPU: Single CPU And the overall usage
  • Memory : Used memory 、 Available memory
  • disk : Disk usage 、 Disk read and write throughput
  • The Internet : Outlet flow 、 Inlet flow 、TCP Connection status

3.3 Database monitoring

Include : Number of database connections 、QPS、TPS、 Number of sessions processed in parallel 、 cache hit rate 、 Master slave delay 、 The lock state 、 The slow query

3.4 Middleware monitoring

  • Nginx: Number of active connections 、 Number of waiting connections 、 Number of dropped connections 、 Request quantity 、 Time consuming 、5XX Error rate
  • Tomcat: Maximum number of threads 、 Current number of threads 、 Request quantity 、 Time consuming 、 Error amount 、 Heap memory usage 、GC Times and time
  • cache : Number of successful connections 、 Number of blocked connections 、 Used memory 、 Memory fragmentation rate 、 Request quantity 、 Time consuming 、 cache hit rate
  • Message queue : The number of connections 、 Number of queues 、 Production rate 、 Consumption rate 、 The amount of information piled up

3.5 Application monitoring

  • HTTP Interface :URL Survive 、 Request quantity 、 Time consuming 、 Abnormal quantity
  • RPC Interface : Request quantity 、 Time consuming 、 Timeout amount 、 Rejection amount
  • JVM :GC frequency 、GC Time consuming 、 The size of each memory area 、 Current number of threads 、 Number of deadlock threads
  • Thread pool : Number of active threads 、 Task queue size 、 Task execution time 、 Number of rejected tasks
  • Connection pool : Total connections 、 Number of active connections
  • Log monitoring : Access log 、 Error log
  • Business indicators : It depends on the business , such as PV、 Order quantity, etc

4. The basic process of monitoring system

Whether it's an open source monitoring system or a self-developed monitoring system , The whole process of monitoring is very similar , Generally, it includes the following modules :

  • Data collection : There are many ways to collect , Include Log buried point for collection ( adopt Logstash、Filebeat To report and analyze ), JMX Standard interface output monitoring indicators , The monitored object provides REST API Do data collection ( Such as Hadoop、ES), System command line , A unified SDK Carry out intrusive burying points and reporting .
  • The data transfer : The collected data will be as follows TCP、UDP perhaps HTTP The protocol is reported to the monitoring system , Take the initiative Push Pattern , There's also passivity Pull Pattern .
  • data storage : Have use MySQL、Oracle etc. RDBMS Stored , There are also time series databases RRDTool、OpentTSDB、InfluxDB Stored , And use HBase Stored .
  • Data presentation : Graphical display of data indicators .
  • Monitoring alarm : Flexible alarm settings , And support email 、 SMS 、IM And so on .

02 Introduction to mainstream monitoring system

Now let's get to know the mainstream open source monitoring system , Due to limited space , I chose 3 The most widely used monitoring system :Zabbix、Open-Falcon、Prometheus, We will introduce their architectures , At the same time, summarize their advantages and disadvantages .

1. Zabbix( An excellent representative of old-fashioned monitoring )

Zabbix 1998 The year was born , The core components adopt C Language development ,Web End use PHP Development . It belongs to the excellent representative of the old monitoring system , The monitoring function is very comprehensive , It's also widely used , Is almost 70% Around the Internet companies have used Zabbix As a monitoring solution .

So let's see Zabbix Architecture design :

  • Zabbix Server: Core components ,C Language writing , Responsible for receiving Agent、Proxy Monitoring data sent , Also support JMX、SNMP And other protocols to collect data directly . meanwhile , It is also responsible for data storage and alarm triggering .

  • Zabbix Proxy: Optional components , For more monitored machines , You can use Proxy Distributed monitoring , It can represent Server Collect some monitoring data , To lessen Server The pressure of the .

  • Zabbix Agentd: Deployed on the monitored host , It is used to collect data from this machine and send it to Proxy perhaps Server, Its plug-in mechanism supports user-defined data collection scripts .Agent Can be found in Server End manual configuration , It can also be identified through an automatic discovery mechanism . Data collection methods also support active Push And passive Pull Two modes .

  • Database: It is used to store configuration information and collected data , Support MySQL、Oracle Relational database . meanwhile , Latest version Zabbix Support for temporal databases has started , But the maturity is not high .

  • Web Server:Zabbix Of GUI Components ,PHP To write , Provide monitoring data display and alarm configuration .

Here is Zabbix The advantages of :

  • Product maturity : Because of its long birth time and wide use , It has rich documentation and various open source data collection plug-ins , It can cover most of the monitoring scenarios .

  • There are many ways to collect : Support Agent、SNMP、JMX、SSH And so on , And active and passive data transmission .

  • Strong scalability : Support Proxy Distributed monitoring , Yes agent Auto discovery function , Plug in architecture supports user-defined data collection scripts .

  • Configuration management is convenient : Can pass Web Interface for monitoring and alarm configuration , It is easy to operate , Easy to get started .

Here is Zabbix The disadvantages of :

  • Performance bottleneck : After a large number of machines or business , Writing to a relational database must be a bottleneck , The official ceiling for a single machine is 5000 platform , I don't feel like , Especially now there are more and more indicators of application layer . Although the latest version has started to support temporal databases , But the maturity is not high .

  • Application layer monitoring support is limited : If you want to do intrusive burying point and acquisition for the application ( For example, monitoring thread pool or interface performance ),zabbix No corresponding sdk, This function can also be realized through plug-in script , Personal feeling zabbix It's not about this .

  • The data model is not powerful : I won't support it tag, Therefore, it is impossible to aggregate statistics and alarm configuration according to multiple dimensions , It's not flexible to use .

  • Convenient secondary development is difficult :Zabbix It's using C Language , Secondary development often requires familiarity with its data table structure , Based on what it provides API More can only do custom display layer .

Open-falcon It's millet. 2015 Open source enterprise level monitoring tool , use Go and Python Language development , This is a flexible 、 High performance and easy to expand the new generation of monitoring solutions , Millet at present 、 Meituan 、 Drop, drop, etc 200 Companies are using it .

Millet was also used in the early days Zabbix monitor , But when the volume of machines and business comes up ,Zabbix I just can't do it . therefore , Later, I developed it independently Open-Falcon, In the architecture design has absorbed Zabbix Experience , At the same time, it solved the problem well Zabbix Many of the pain points of .

So let's see Open-Falcon Architecture design :

  • Falcon-agent: Data collectors and collectors ,Go Development , Deployed on the monitored machine , Support 3 Data collection methods . First of all, it can automatically collect a single machine 200 Multiple basic monitoring indicators , No configuration is required ; At the same time, it supports user-defined plugin Get monitoring data ; Besides , User access http Interface , independent push Data to local proxy-gateway, from gateway Forwarding to server.

  • Transfer: Data distribution components , Receive data sent by client , Send to the data storage component respectively Graph And alarm decision components Judge,Graph and Judge Consistency is used hash Do data slicing , To improve the ability to scale out . meanwhile Transfer It also supports the distribution of data to OpenTSDB, For historical archiving .

  • Graph: Data storage components , Bottom use RRDTool( Time series database ) Do a single index storage , And through caching 、 Batch write to disk and other ways are optimized . It is said that one graph Instances can handle 8W+ Write rate per second .

  • Judge and Alarm: Alarm components ,Judge Yes Transfer The data reported by the component is calculated in real time , Determine whether to generate an alarm event ,Alarm After the component converges the alarm event , Push the alarm message to each message channel .

  • API: For end users , When you receive a request for inquiry, you will go to Graph Query index data in , Summarize the results and return them to the user , The partition details of the storage cluster are masked .

Here is Open-Falcon The advantages of :

  • Automatic acquisition capability :**Falcon-agent Can automatically collect the server's 200 Multiple basic indicators ( such as CPU、 Memory, etc. ), No need to server Do any configuration on , This can be killed in seconds Zabbix.

  • Powerful storage capacity : The underlying the RRDTool, And through consistency hash Data fragmentation , A distributed time series data storage system is constructed , High scalability .

  • Flexible data model : reference OpenTSDB, The data model introduces tag, This can support multi-dimensional aggregation statistics and alarm rule settings , Greatly improved the efficiency of use .

  • Plug in unified management :Open-Falcon The plug-in mechanism realizes the unified management of user-defined scripts , It can be done by HeartBeat Server Distributed to the agent, Reduce the cost of user independent maintenance script .

  • Personalized monitoring support : be based on Proxy-gateway, It is easy to realize the application layer monitoring through independent buried point ( For example, monitoring the access volume and time consumption of the interface ) And other personalized monitoring needs , Easy integration .

Here is Open-Falcon The disadvantages of :

  • The overall development is average ****: Community activity is not high , At the same time, the version update is slow , Some big factories do secondary development directly based on its stable version , I'm a little worried about the future .

  • UI Not friendly enough : For line of business R & D , You may just want to easily complete alarm configuration and service monitoring , But it groups the machines 、 Strategy templates 、 Concepts such as template inheritance are all exposed in UI On , I feel like I'm designing around these concepts UI, It's a little hard to understand .

  • Installation is more complicated : Personal experience , Because it is derived from the inside of millet , Although the dependence on Millet's internal system has been removed , But there are still many components , If you're not familiar with the whole architecture , It's hard to install in one go .

3. Prometheus( Known as the next generation of monitoring system )

Prometheus( Prometheus ) It's from the front google staff 2015 Open source monitoring system officially released in , use Go Language development . It's not just a cool name , At the same time, it has Google And k8s Strong support for , The open source community is very popular .

Prometheus 2016 Joined the cloud native foundation , Is the k8s The second project after hosting , The future is quite promising . It and Open-Falcon The biggest difference is : Data collection is based on Pull Mode , instead of Push Pattern , And the architecture is very simple .

So let's see Prometheus Architecture design :

  • Prometheus Server: Core components , Used to collect 、 Store monitoring data . It supports both static configuration and through Service Discovery Dynamic discovery to manage monitoring targets , And get data from monitoring targets . Besides ,Prometheus Server It's also a time series database , It saves the monitoring data on the local disk , And provide customized PromQL Language to realize the query and analysis of data .

  • Exporter: To collect data , Function like agent, The difference lies in Prometheus Is based on Pull Method to pull the collected data , therefore ,Exporter adopt HTTP In the form of a service, the monitoring data is exposed to Prometheus Server, There are already a lot of ready-made Exporter You can use it directly , Users can also use various languages client library Custom implementation .

  • Push gateway: It is mainly used in the scenario of instantaneous task , prevent Prometheus Server Come on pull Before the data Short-lived jobs It's done , therefore job May adopt push To actively report the monitoring data to Push gateway Cache it for transfer .

  • Alert Manager: When an alarm is generated ,Prometheus Server Push alarm information to Alert Manager, It sends alarm information to the receiver .

  • Web UI:Prometheus Built in a simple web Console , You can query configuration information and indicators, etc , In practical application, we usually use Prometheus As Grafana Data source , Create dashboards and view metrics .

Here is Prometheus The advantages of :

  • Light management : Simple architecture , Independent of external storage , A single server node can work directly , The binary file can be started , Belong to the light weight Server, Easy to migrate and maintain .

  • Strong processing power : Monitoring data is stored directly in Prometheus Server In the local time series database , A single instance can handle millions of metrics.

  • Flexible data model : Same as Open-Falcon, Introduced tag, It belongs to multidimensional data model , Aggregate statistics are more convenient .

  • Powerful query statements :PromQL Allow in the same query statement , For many metrics Add 、 Linking and taking quantile values, etc .

  • Good support for cloud environment : Can automatically find containers , meanwhile k8s and etcd And so on. All the projects have provided for Prometheus Native support for , Is currently the most popular container monitoring program .

Here is Prometheus The disadvantages of :

  • The function is not perfect :Prometheus Architecture design from the beginning is to be simple , There is no clustering solution , Long term persistent storage and user management , And these are the characteristics that enterprises must have when they grow bigger , At present, we can only do this in Prometheus Expand on .

  • Network planning becomes more complex : because Prometheus It's using Pull Model pull data , It means all the monitored endpoint It has to be accessible , It is necessary to plan the security configuration of the network .

03 Suggestions for the selection of monitoring system

Through the introduction above , We should have a certain understanding of the mainstream monitoring system . Facing the problem of model selection , My advice is :

1、 Be clear about your monitoring needs first : What are the objects to be monitored ? How many machines and monitoring indicators are there ? What kind of alarm function is needed ?

2、 Monitoring is a long-term construction thing , I wanted to make one from the beginning All In One Monitoring solutions for , I don't think it's necessary . From a cost perspective , In the initial stage, we can directly use the open source monitoring scheme , Solve the problem first .

3、 In terms of system maturity ,Zabbix It belongs to the old monitoring system , More information , Comprehensive and stable , If the number of machines is within a few hundred , Don't worry too much about performance , in addition , Using database partition 、SSD Hard disk 、Proxy framework 、Push Collection mode can improve monitoring performance .

4、Zabbix It has an absolute advantage in server monitoring , You can meet 90% The above monitoring scenario , But application layer monitoring doesn't seem to be good at , For example, to monitor the state of the thread pool 、 The execution time of an internal interface, etc , This is usually done as an intrusive burying point . contrary , A new generation of monitoring systems Open-Falcon and Prometheus Well done at this point .

5、 In terms of overall performance , The new generation of monitoring system also has obvious advantages , such as : Flexible data model 、 A more mature temporal database 、 Powerful alarm function , If I had been to zabbix This kind of traditional monitoring has no technology accumulation , It is recommended to use Open-Falcon perhaps Prometheus.

6、Open-Falcon The core advantage of is the function of data fragmentation , Can support more machines and monitoring items ;Prometheus It is the standard configuration of container monitoring , Yes Google and k8s The blessing .

7、Zabbix、Open-Falcon and Prometheus Both support and Grafana Do rapid integration , Want a beautiful and powerful visual experience , You can talk to Grafana Are combined .

8、 Use the appropriate monitoring system to solve the corresponding problems , Multiple sets of monitoring can be used at the same time , This is very common in the early days of an enterprise .

9、 In the middle and later period , With the increase of machine data and personalized demand ( For example, we hope to unify the monitoring platform 、 Get through to the company CMDB And organizational structure ), Often need secondary development or through the monitoring system to provide API Do integration , From this point of view ,Open-Falcon perhaps Prometheus More appropriate .

10、 If you have to do it yourself , We can study the architecture of the mainstream monitoring system , Draw on their strengths .

At the end

In this paper, the basic knowledge of monitoring system 、 The principle and mainstream architecture have been combed in detail , I hope it will help you to understand the monitoring system , And make a more appropriate choice in technology selection .

Because of space , The content of this paper does not involve the whole link monitoring 、 Log monitoring 、 as well as Web Front end and client monitoring , It can be seen that monitoring is really a huge and complex system , If you want to understand thoroughly , It is necessary to combine theory with practice and go further .

For the operation and maintenance monitoring system , If you have your own experience and experience , Feel free to leave a comment .

Author's brief introduction :985 master , Former Amazon Engineer , present 58 Transfer to technical director

Welcome to scan the QR code below , Pay attention to my official account :IT People's career advancement

版权声明
本文为[Career advancement of it people]所创,转载请带上原文链接,感谢