Before , I've written a few articles about 「 Online troubleshooting 」 The article , Some monitoring charts are attached , Some readers are interested in this , Ask me if I have any good suggestions on the selection of monitoring system ？
Several companies I've been through at the moment , The monitoring system is self-developed . In fact, the industry has many excellent open source products to choose from , It can meet most of the monitoring requirements , If you can choose one of them to meet the current demands of enterprises , Obviously the most time-saving and labor-saving .
This article , I will have a basic knowledge of monitoring system 、 Make a systematic arrangement of the principle and architecture , At the same time, it will introduce some of the most commonly used open source monitoring products , For your reference when selecting models . The content includes 3 part ：
- Must know must know the basic knowledge of monitoring
- Introduction to mainstream monitoring system
- Suggestions for the selection of monitoring system
01 Must know must know the basic knowledge of monitoring
Monitoring system is commonly known as 「 The third eye 」, It's a system that we deal with almost every day , below 4 I think it is necessary to understand the basic knowledge .
1. Supervisory system 7 Great effect
As the saying goes 「 There is no monitoring , No operation and maintenance 」, The status of monitoring system is self-evident . Whether you're a developer or a user of a surveillance system , First of all, make sure that ： What is the goal of the monitoring system ？ What can it do ？
Real time acquisition of monitoring data ： Include Hardware 、 operating system 、 middleware 、 Applications and other dimensions of data .
Real time feedback monitoring status ： Through the multi-dimensional statistics and visualization of the collected data , It can reflect whether the state of the monitoring object is normal or abnormal in real time .
Predict faults and alarms ： Be able to predict the risk of failure in advance , And send out alarm information in time .
Auxiliary positioning fault ： Provide the index data when the fault occurs , Auxiliary fault analysis and location .
Auxiliary performance tuning ： Provide data support for performance tuning , Like slow SQL, Interface response time, etc .
Auxiliary capacity planning ： For the server 、 Middleware and application cluster capacity planning provides data support .
Operation and maintenance automation ： For automatic expansion or as configured SLA Provide data support for intelligent operation and maintenance such as service degradation .
2. Use the correct posture of the monitoring system
Any online incident , Let's not say that there are problems elsewhere , There must be something wrong with the monitoring section .
Listen to a word that shakes the pot , There seems to be a point in thinking carefully . When we come back to the accident , I usually think about this 3 A monitoring problem ： There's no monitoring ？ Whether the monitoring is timely ？ Whether monitoring information helps to locate problems quickly ？
A good monitoring system for visible light is not enough , You have to know 「 How to use it well 」. A mature R & D team usually sets a monitoring specification , How to use the unified monitoring system .
- Understand how monitoring objects work ： To achieve a basic understanding of the monitoring object , Understand how it works . For example, if you want to be right about JVM monitor , You have to be clear about JVM Heap memory structure and garbage collection mechanism of .
- Determine the target of monitoring ： Be clear about which indicators are used to characterize the state of the monitored object ？ For example, you want to monitor an interface , You can use the amount of requests 、 Time consuming 、 Timeout amount 、 Abnormal quantity and other indicators to measure .
- Define reasonable alarm thresholds and levels ： What threshold should be reached ？ What is the corresponding fault level ？ Alarms that do not need to be handled are not good alarms , We can see how important it is to define a reasonable threshold , Otherwise, it will only reduce the operation and maintenance efficiency or make the monitoring system lose its function .
- Establish a complete troubleshooting process ： After receiving the fault alarm , There must be a corresponding process flow and oncall Mechanism , Let the fault be followed up in time .
3. What are the monitoring objects and indicators ？
Monitoring has become a very important part of the whole product life cycle , Operation and maintenance focuses on hardware and basic monitoring , R & D focuses on monitoring all kinds of middleware and application layer , Products focus on the monitoring of core business indicators . so , The object of monitoring has become more and more stereoscopic .
here , I have classified and sorted out the commonly used monitoring objects and monitoring indicators , For your reference .
3.1 Hardware monitoring
Include ： Power status 、CPU state 、 Machine temperature 、 Fan status 、 Physical disks 、raid state 、 Memory status 、 Network card status
3.2 Server basic monitoring
- CPU： Single CPU And the overall usage
- Memory ： Used memory 、 Available memory
- disk ： Disk usage 、 Disk read and write throughput
- The Internet ： Outlet flow 、 Inlet flow 、TCP Connection status
3.3 Database monitoring
Include ： Number of database connections 、QPS、TPS、 Number of sessions processed in parallel 、 cache hit rate 、 Master slave delay 、 The lock state 、 The slow query
3.4 Middleware monitoring
- Nginx： Number of active connections 、 Number of waiting connections 、 Number of dropped connections 、 Request quantity 、 Time consuming 、5XX Error rate
- Tomcat： Maximum number of threads 、 Current number of threads 、 Request quantity 、 Time consuming 、 Error amount 、 Heap memory usage 、GC Times and time
- cache ： Number of successful connections 、 Number of blocked connections 、 Used memory 、 Memory fragmentation rate 、 Request quantity 、 Time consuming 、 cache hit rate
- Message queue ： The number of connections 、 Number of queues 、 Production rate 、 Consumption rate 、 The amount of information piled up
3.5 Application monitoring
- HTTP Interface ：URL Survive 、 Request quantity 、 Time consuming 、 Abnormal quantity
- RPC Interface ： Request quantity 、 Time consuming 、 Timeout amount 、 Rejection amount
- JVM ：GC frequency 、GC Time consuming 、 The size of each memory area 、 Current number of threads 、 Number of deadlock threads
- Thread pool ： Number of active threads 、 Task queue size 、 Task execution time 、 Number of rejected tasks
- Connection pool ： Total connections 、 Number of active connections
- Log monitoring ： Access log 、 Error log
- Business indicators ： It depends on the business , such as PV、 Order quantity, etc
4. The basic process of monitoring system
Whether it's an open source monitoring system or a self-developed monitoring system , The whole process of monitoring is very similar , Generally, it includes the following modules ：
- Data collection ： There are many ways to collect , Include Log buried point for collection （ adopt Logstash、Filebeat To report and analyze ）, JMX Standard interface output monitoring indicators , The monitored object provides REST API Do data collection （ Such as Hadoop、ES）, System command line , A unified SDK Carry out intrusive burying points and reporting .
- The data transfer ： The collected data will be as follows TCP、UDP perhaps HTTP The protocol is reported to the monitoring system , Take the initiative Push Pattern , There's also passivity Pull Pattern .
- data storage ： Have use MySQL、Oracle etc. RDBMS Stored , There are also time series databases RRDTool、OpentTSDB、InfluxDB Stored , And use HBase Stored .
- Data presentation ： Graphical display of data indicators .
- Monitoring alarm ： Flexible alarm settings , And support email 、 SMS 、IM And so on .
02 Introduction to mainstream monitoring system
Now let's get to know the mainstream open source monitoring system , Due to limited space , I chose 3 The most widely used monitoring system ：Zabbix、Open-Falcon、Prometheus, We will introduce their architectures , At the same time, summarize their advantages and disadvantages .
1. Zabbix（ An excellent representative of old-fashioned monitoring ）
Zabbix 1998 The year was born , The core components adopt C Language development ,Web End use PHP Development . It belongs to the excellent representative of the old monitoring system , The monitoring function is very comprehensive , It's also widely used , Is almost 70% Around the Internet companies have used Zabbix As a monitoring solution .
So let's see Zabbix Architecture design ：
Zabbix Server： Core components ,C Language writing , Responsible for receiving Agent、Proxy Monitoring data sent , Also support JMX、SNMP And other protocols to collect data directly . meanwhile , It is also responsible for data storage and alarm triggering .
Zabbix Proxy： Optional components , For more monitored machines , You can use Proxy Distributed monitoring , It can represent Server Collect some monitoring data , To lessen Server The pressure of the .
Zabbix Agentd： Deployed on the monitored host , It is used to collect data from this machine and send it to Proxy perhaps Server, Its plug-in mechanism supports user-defined data collection scripts .Agent Can be found in Server End manual configuration , It can also be identified through an automatic discovery mechanism . Data collection methods also support active Push And passive Pull Two modes .
Database： It is used to store configuration information and collected data , Support MySQL、Oracle Relational database . meanwhile , Latest version Zabbix Support for temporal databases has started , But the maturity is not high .
Web Server：Zabbix Of GUI Components ,PHP To write , Provide monitoring data display and alarm configuration .
Here is Zabbix The advantages of ：
Product maturity ： Because of its long birth time and wide use , It has rich documentation and various open source data collection plug-ins , It can cover most of the monitoring scenarios .
There are many ways to collect ： Support Agent、SNMP、JMX、SSH And so on , And active and passive data transmission .
Strong scalability ： Support Proxy Distributed monitoring , Yes agent Auto discovery function , Plug in architecture supports user-defined data collection scripts .
Configuration management is convenient ： Can pass Web Interface for monitoring and alarm configuration , It is easy to operate , Easy to get started .
Here is Zabbix The disadvantages of ：
Performance bottleneck ： After a large number of machines or business , Writing to a relational database must be a bottleneck , The official ceiling for a single machine is 5000 platform , I don't feel like , Especially now there are more and more indicators of application layer . Although the latest version has started to support temporal databases , But the maturity is not high .
Application layer monitoring support is limited ： If you want to do intrusive burying point and acquisition for the application （ For example, monitoring thread pool or interface performance ）,zabbix No corresponding sdk, This function can also be realized through plug-in script , Personal feeling zabbix It's not about this .
The data model is not powerful ： I won't support it tag, Therefore, it is impossible to aggregate statistics and alarm configuration according to multiple dimensions , It's not flexible to use .
Convenient secondary development is difficult ：Zabbix It's using C Language , Secondary development often requires familiarity with its data table structure , Based on what it provides API More can only do custom display layer .
2. Open-Falcon（ Millet products , It's popular in China ）
Open-falcon It's millet. 2015 Open source enterprise level monitoring tool , use Go and Python Language development , This is a flexible 、 High performance and easy to expand the new generation of monitoring solutions , Millet at present 、 Meituan 、 Drop, drop, etc 200 Companies are using it .
Millet was also used in the early days Zabbix monitor , But when the volume of machines and business comes up ,Zabbix I just can't do it . therefore , Later, I developed it independently Open-Falcon, In the architecture design has absorbed Zabbix Experience , At the same time, it solved the problem well Zabbix Many of the pain points of .
So let's see Open-Falcon Architecture design ：
Falcon-agent： Data collectors and collectors ,Go Development , Deployed on the monitored machine , Support 3 Data collection methods . First of all, it can automatically collect a single machine 200 Multiple basic monitoring indicators , No configuration is required ; At the same time, it supports user-defined plugin Get monitoring data ; Besides , User access http Interface , independent push Data to local proxy-gateway, from gateway Forwarding to server.
Transfer： Data distribution components , Receive data sent by client , Send to the data storage component respectively Graph And alarm decision components Judge,Graph and Judge Consistency is used hash Do data slicing , To improve the ability to scale out . meanwhile Transfer It also supports the distribution of data to OpenTSDB, For historical archiving .
Graph： Data storage components , Bottom use RRDTool（ Time series database ） Do a single index storage , And through caching 、 Batch write to disk and other ways are optimized . It is said that one graph Instances can handle 8W+ Write rate per second .
Judge and Alarm： Alarm components ,Judge Yes Transfer The data reported by the component is calculated in real time , Determine whether to generate an alarm event ,Alarm After the component converges the alarm event , Push the alarm message to each message channel .
API： For end users , When you receive a request for inquiry, you will go to Graph Query index data in , Summarize the results and return them to the user , The partition details of the storage cluster are masked .
Here is Open-Falcon The advantages of ：
Automatic acquisition capability ：**Falcon-agent Can automatically collect the server's 200 Multiple basic indicators （ such as CPU、 Memory, etc. ）, No need to server Do any configuration on , This can be killed in seconds Zabbix.
Powerful storage capacity ： The underlying the RRDTool, And through consistency hash Data fragmentation , A distributed time series data storage system is constructed , High scalability .
Flexible data model ： reference OpenTSDB, The data model introduces tag, This can support multi-dimensional aggregation statistics and alarm rule settings , Greatly improved the efficiency of use .
Plug in unified management ：Open-Falcon The plug-in mechanism realizes the unified management of user-defined scripts , It can be done by HeartBeat Server Distributed to the agent, Reduce the cost of user independent maintenance script .
Personalized monitoring support ： be based on Proxy-gateway, It is easy to realize the application layer monitoring through independent buried point （ For example, monitoring the access volume and time consumption of the interface ） And other personalized monitoring needs , Easy integration .
Here is Open-Falcon The disadvantages of ：
The overall development is average ****： Community activity is not high , At the same time, the version update is slow , Some big factories do secondary development directly based on its stable version , I'm a little worried about the future .
UI Not friendly enough ： For line of business R & D , You may just want to easily complete alarm configuration and service monitoring , But it groups the machines 、 Strategy templates 、 Concepts such as template inheritance are all exposed in UI On , I feel like I'm designing around these concepts UI, It's a little hard to understand .
Installation is more complicated ： Personal experience , Because it is derived from the inside of millet , Although the dependence on Millet's internal system has been removed , But there are still many components , If you're not familiar with the whole architecture , It's hard to install in one go .
3. Prometheus（ Known as the next generation of monitoring system ）
Prometheus（ Prometheus ） It's from the front google staff 2015 Open source monitoring system officially released in , use Go Language development . It's not just a cool name , At the same time, it has Google And k8s Strong support for , The open source community is very popular .
Prometheus 2016 Joined the cloud native foundation , Is the k8s The second project after hosting , The future is quite promising . It and Open-Falcon The biggest difference is ： Data collection is based on Pull Mode , instead of Push Pattern , And the architecture is very simple .
So let's see Prometheus Architecture design ：
Prometheus Server： Core components , Used to collect 、 Store monitoring data . It supports both static configuration and through Service Discovery Dynamic discovery to manage monitoring targets , And get data from monitoring targets . Besides ,Prometheus Server It's also a time series database , It saves the monitoring data on the local disk , And provide customized PromQL Language to realize the query and analysis of data .
Exporter： To collect data , Function like agent, The difference lies in Prometheus Is based on Pull Method to pull the collected data , therefore ,Exporter adopt HTTP In the form of a service, the monitoring data is exposed to Prometheus Server, There are already a lot of ready-made Exporter You can use it directly , Users can also use various languages client library Custom implementation .
Push gateway： It is mainly used in the scenario of instantaneous task , prevent Prometheus Server Come on pull Before the data Short-lived jobs It's done , therefore job May adopt push To actively report the monitoring data to Push gateway Cache it for transfer .
Alert Manager： When an alarm is generated ,Prometheus Server Push alarm information to Alert Manager, It sends alarm information to the receiver .
Web UI：Prometheus Built in a simple web Console , You can query configuration information and indicators, etc , In practical application, we usually use Prometheus As Grafana Data source , Create dashboards and view metrics .
Here is Prometheus The advantages of ：
Light management ： Simple architecture , Independent of external storage , A single server node can work directly , The binary file can be started , Belong to the light weight Server, Easy to migrate and maintain .
Strong processing power ： Monitoring data is stored directly in Prometheus Server In the local time series database , A single instance can handle millions of metrics.
Flexible data model ： Same as Open-Falcon, Introduced tag, It belongs to multidimensional data model , Aggregate statistics are more convenient .
Powerful query statements ：PromQL Allow in the same query statement , For many metrics Add 、 Linking and taking quantile values, etc .
Good support for cloud environment ： Can automatically find containers , meanwhile k8s and etcd And so on. All the projects have provided for Prometheus Native support for , Is currently the most popular container monitoring program .
Here is Prometheus The disadvantages of ：
The function is not perfect ：Prometheus Architecture design from the beginning is to be simple , There is no clustering solution , Long term persistent storage and user management , And these are the characteristics that enterprises must have when they grow bigger , At present, we can only do this in Prometheus Expand on .
Network planning becomes more complex ： because Prometheus It's using Pull Model pull data , It means all the monitored endpoint It has to be accessible , It is necessary to plan the security configuration of the network .
03 Suggestions for the selection of monitoring system
Through the introduction above , We should have a certain understanding of the mainstream monitoring system . Facing the problem of model selection , My advice is ：
1、 Be clear about your monitoring needs first ： What are the objects to be monitored ？ How many machines and monitoring indicators are there ？ What kind of alarm function is needed ？
2、 Monitoring is a long-term construction thing , I wanted to make one from the beginning All In One Monitoring solutions for , I don't think it's necessary . From a cost perspective , In the initial stage, we can directly use the open source monitoring scheme , Solve the problem first .
3、 In terms of system maturity ,Zabbix It belongs to the old monitoring system , More information , Comprehensive and stable , If the number of machines is within a few hundred , Don't worry too much about performance , in addition , Using database partition 、SSD Hard disk 、Proxy framework 、Push Collection mode can improve monitoring performance .
4、Zabbix It has an absolute advantage in server monitoring , You can meet 90% The above monitoring scenario , But application layer monitoring doesn't seem to be good at , For example, to monitor the state of the thread pool 、 The execution time of an internal interface, etc , This is usually done as an intrusive burying point . contrary , A new generation of monitoring systems Open-Falcon and Prometheus Well done at this point .
5、 In terms of overall performance , The new generation of monitoring system also has obvious advantages , such as ： Flexible data model 、 A more mature temporal database 、 Powerful alarm function , If I had been to zabbix This kind of traditional monitoring has no technology accumulation , It is recommended to use Open-Falcon perhaps Prometheus.
6、Open-Falcon The core advantage of is the function of data fragmentation , Can support more machines and monitoring items ;Prometheus It is the standard configuration of container monitoring , Yes Google and k8s The blessing .
7、Zabbix、Open-Falcon and Prometheus Both support and Grafana Do rapid integration , Want a beautiful and powerful visual experience , You can talk to Grafana Are combined .
8、 Use the appropriate monitoring system to solve the corresponding problems , Multiple sets of monitoring can be used at the same time , This is very common in the early days of an enterprise .
9、 In the middle and later period , With the increase of machine data and personalized demand （ For example, we hope to unify the monitoring platform 、 Get through to the company CMDB And organizational structure ）, Often need secondary development or through the monitoring system to provide API Do integration , From this point of view ,Open-Falcon perhaps Prometheus More appropriate .
10、 If you have to do it yourself , We can study the architecture of the mainstream monitoring system , Draw on their strengths .
At the end
In this paper, the basic knowledge of monitoring system 、 The principle and mainstream architecture have been combed in detail , I hope it will help you to understand the monitoring system , And make a more appropriate choice in technology selection .
Because of space , The content of this paper does not involve the whole link monitoring 、 Log monitoring 、 as well as Web Front end and client monitoring , It can be seen that monitoring is really a huge and complex system , If you want to understand thoroughly , It is necessary to combine theory with practice and go further .
For the operation and maintenance monitoring system , If you have your own experience and experience , Feel free to leave a comment .
Author's brief introduction ：985 master , Former Amazon Engineer , present 58 Transfer to technical director
Welcome to scan the QR code below , Pay attention to my official account ：IT People's career advancement