当前位置:网站首页>Finally, someone explained Prometheus introduction clearly | leave a message and send a book

Finally, someone explained Prometheus introduction clearly | leave a message and send a book

2020-12-08 14:24:12 osc_ b6tyukpz

Prometheus It's both a temporal database , Another monitoring system , It is also a complete set of monitoring ecological solutions . As a temporal database , stay 2020 year 2 In the monthly rankings ,Prometheus Has jumped to third place , Beyond the old time series database OpenTSDB、Graphite、RRDtool、KairosDB etc. , Pictured 1 Shown .

chart 1 Temporal database ranking

As a monitoring system ,2018 year 8 month 9 Japan CNCF stay PromCon( year Prometheus meeting ) Announce :Prometheus Is the Kubernetes The second one after that CNCF“ graduation ” project . stay CNCF In managed projects , From incubation to graduation , The project must be widely adopted by the community , And there are complete documents of governance process , And a strong commitment to community sustainability and inclusiveness .Prometheus The open source community is very active , stay GitHub Have a contract with 30000 star , And there are often small versions of updates posted on it . except PromCon、KubeCon and CloudNativeCon outside ,CNCF Also for adopters 、 Developers and practitioners build a platform for face-to-face cooperation , And Kubernetes、Prometheus And others CNCF Managed project leaders explore industry development , Setting the development direction of cloud primary ecosystem together . surface 1 The show is 2020 year KubeCon and CloudNativeCon Focus on CNCF Open source software , It can be seen from it that Prometheus Importance .

surface 1 2020 year KubeCon and CloudNativeCon Focus on CNCF Open source software

As a monitoring ecological solution , just Prometheus Of Exporter It has already supported thousands of common software, both official and not included 、 middleware 、 Monitoring of systems, etc .

Prometheus A brief history of development

Prometheus and Kubernetes It's not only closely related in use , And it has a deep history . In mountain view, California Google There were two systems in the company ——Borg The system and its monitoring Borgmon System .Borg System is Google Internally used to manage from different applications 、 Cluster manager for different jobs , Each cluster will have tens of thousands of servers and tens of thousands of jobs ;Borgmon The system is related to Borg Monitoring system supporting the system .Borg Systems and Borgmon There is no open source system , But now it's open source Kubernetes、Prometheus All of them are the inheritance of their ideas .

Kubernetes The system is inherited from Borg System ,Prometheus It is inherited from Borgmon System .Google SRE It is also mentioned in the book , And Borgmon A similar implementation of the monitoring system is Prometheus. Now the most common Kubernetes In container scheduling management system , Usually with Prometheus monitor .

2012 Years ago Google SRE The engineer Matt T. Proud take Prometheus Started as a research project , Before he joined SoundCloud After the company , With another engineer Julius Volz In the form of open source software Prometheus R & D , And 2015 At the beginning of the year, an early version was released to the public .Prometheus It's a separate open source project , And it's run by the company , There's a very active community and a lot of developers , So many companies use it to meet their monitoring needs .2016 year 5 month , Following Kubernetes after Prometheus Become the second official to join CNCF Project , Same year 6 Officially released in 1.0 edition .2017 At the end of the year 2.0 edition , This version works better with the container platform 、 Cloud platform is compatible with .

Prometheus The home page of the official website is shown in the picture 2 Shown .Prometheus Mainly used to provide near real time , Based on dynamic cloud environment 、 Containers 、 Microservices 、 Monitoring services for applications, etc . stay 《 Site stability Engineering :Google How to run a reliable system 》 It is also mentioned in the book , Even though Borgmon It's still inside Google , But the idea of using time series data as a data source for generating alerts , Have gone through Prometheus The perfect embodiment of . This directly shows that Prometheus and Kubernetes、Google It has a deep historical origin .

chart 2 Prometheus Website home page

Prometheus The main characteristics of

Prometheus The statement on the official website is :“From metrics to insight. Power your metrics and alerting with a leading open-source monitoring solution.” Which translates as : From indicators to insights ,Prometheus Through the leading open source monitoring solution to provide strong support for user indicators and alarms .

And Nagios、Zabbix、Ganglia、Open-Falcon Compared with many monitoring systems ,Prometheus The main features are 4 individual :

  • adopt PromQL Flexible query of multi-dimensional data model .

  • The standard of open index data is defined , Custom probes ( Such as Exporter etc. ), It's easy to write .

  • PushGateway Components enable the monitoring system to receive monitoring data .

  • Provides VM And containerized versions .

Especially the first point , This is beyond the reach of many monitoring systems . Multi dimensional data model and flexible query mode , Enables monitoring indicators to be associated with multiple tags , And the time series are sliced and sliced , To support a variety of graphics 、 Tables and alarm scenarios .

In addition to the above 4 In addition to the characteristics ,Prometheus There are also the following features :

  • Go Language writing , Embrace cloud native .

  • It mainly adopts pull mode 、 Push mode as a supplementary way to collect data .

  • The binary file starts directly , It also supports containerized deployment images .

  • Multi language client support , Such as Java、JMX、Python、Go、Ruby、.NET、Node.js Other languages .

  • Support local and third party remote storage , Strong single machine performance , Can handle thousands of target And million time series per second .

  • Efficient storage . On average, one sample takes up 3.5B about , common 320 Ten thousand time series , Every time 30 Take a sample every second , It's going on like this 60 God , It takes up about 228GB( There is a certain margin , Some items that take up disk space are not listed here ).

  • Scalable . It can be run independently in each data center or by each team Prometheus Server. You can also use federated clusters to allow multiple Prometheus Instance generates a logical cluster , When a single instance Prometheus Server When the task is too large , By using functional partitions (sharding)+ Federal clusters (federation) Expand it .

  • Excellent visualization .Prometheus There are many visual patterns , For example, the built-in expression browser 、Grafana Integration and console template language . It also provides HTTP Query interface , Easy to combine with other GUI Components or scripts display data .

  • Accurate alarm .Prometheus Based on Flexible PromQL Statement can be used to set alarm 、 Forecast, etc , It also provides grouping 、 Inhibition 、 Silence and other functions prevent warning storms .

  • It supports automatic discovery mechanisms such as static file configuration and dynamic discovery , At present, it has supported Kubernetes、etcd、Consul And so on , This can greatly reduce the workload of manual configuration in the process of container publishing .

  • open .Prometheus Of client library The output format not only supports Prometheus Format data for , You can also use Prometheus The output supports other monitoring systems ( such as Graphite) Format data for .

Prometheus There are also some limitations , It mainly includes the following aspects :

  • Prometheus Mainly for performance and availability monitoring , Not applicable for logs (Log)、 event (Event)、 Call chain (Tracing) And so on .

  • Prometheus Focus on recent events , Instead of tracking weeks or months of data . Because most of the monitoring queries and alarms are targeted at the latest ( Usually in less than a day ) The data of .Prometheus Think the most useful data is the most recent data , Monitoring data is reserved by default 15 God .

  • Local storage is limited , The storage of a large amount of historical data needs to be connected to the third-party remote storage .

  • In the form of Federated clusters , It doesn't provide a unified global view .

  • Prometheus There is no definition of units in the monitoring data of .

  • Prometheus Statistics of the data cannot be done 100% accuracy , Like an order 、 payment 、 Accurate data monitoring scenarios such as metering and billing .

  • Prometheus The default is pull model , It is suggested to plan the network reasonably , Try not to forward .

Prometheus Architecture analysis

Prometheus The structure of is as shown in the figure 3 Shown , It shows. Prometheus The relationship between internal modules and related peripheral components .

chart 3 Prometheus Architecture diagram

Pictured 3 Shown ,Prometheus Mainly by Prometheus Server、Pushgateway、Job/Exporter、Service Discovery、Alertmanager、Dashboard this 6 It consists of core modules .Prometheus Discover through service discovery mechanism target, These goals can be implemented over a long period of time Job, It can also be executed in a short time Job, It can also be through Exporter Third party applications for monitoring . The captured data will be stored , adopt PromQL Statements are available in visual systems such as dashboards , Or to Alertmanager Send alarm information , Alarms will go through the page 、 E-mail 、 Nail information or other forms of presentation .

As can be seen from the above architecture diagram ,Prometheus It's not just a time series database , It is a complete monitoring system in the whole ecology . For time series databases , When making technology selection , It is often necessary to store from the wide column model 、 class SQL Query support 、 Levels increase 、 Read / write separation 、 High performance, etc . And the architecture of the monitoring system , In addition to the factors that need to be considered when selecting the type , Often also need to consider by reducing components 、 Services to reduce costs and complexity and horizontal expansion and other factors .

Many enterprises often use message queue in their own monitoring system Kafka and Metrics parser、Metrics process server etc. Metrics Parsing processing module , Supplemented by Spark Equal flow processing . The application will Metric Push to message queue ( Such as Kafaka), And then pass by Exposer transit , Re be Prometheus Pull . The reason for this is that , Because of the historical burden 、 Reuse existing components 、 adopt MQ( Message queue ) To improve scalability and other factors . This plan will have the following problems :

  • Added query component , For example, basic sum、count、average Functions need extra computation . On the one hand, there is a layer of dependence , In the case of query module connection failure, it will provide an additional layer of failure risk ; On the other hand , Many basic query functions need to consume resources . And in the Prometheus In the framework of , All of these features are supported .

  • The grab time may be out of sync , Delayed data will be marked as stale data . If you identify the data by adding a timestamp , You lose the logic of processing old data .

  • Prometheus It is suitable for monitoring a large number of small targets , Instead of monitoring a big target , If you put all the data in Exposer in , that Prometheus The individual Job Pull will become CPU Bottleneck . This architecture design and Pushgateway similar , So if it's not a particularly necessary scene , Officials don't recommend the use of .

  • Lack of service discovery and pull control mechanism ,Prometheus Can only identify Exposer modular , I don't know exactly what target, I don't know every one of them target Of UP Time , So it can't be used Scrape_* And so on , It doesn't work scrape_limit Make restrictions .

For these heavy dependence , You can consider optimizing it , and Prometheus This architecture is based on pull mode , The implementation of this aspect is a good reference direction . Empathy , The monitoring system of many enterprises is for cmdb Have strong dependence , adopt Prometheus This architecture can also eliminate tags for cmdb Dependence .

Job/Exporter

Job/Exporter Belong to Prometheus target, yes Prometheus The object of monitoring .

Job It can be divided into long-term execution and short-term execution . For long-term execution Job, have access to Prometheus Client Integration for monitoring ; For a short period of time Job, Monitoring data can be pushed to Pushgateway Medium cache .

Prometheus Included in the Exporter There are thousands of , It can be used to monitor third-party systems .Exporter The mechanism of the third-party system is to use the monitoring data of the third-party system according to Prometheus The format is exposed , No, Exporter Third party systems can be customized Exporter, This will be described in detail in later chapters .Prometheus It's a white box surveillance system , It collects metrics that are exposed within the application . If the user wants to check from the outside , This would involve black box monitoring ,Prometheus The black box used in Exporter Namely blackbox_exporter.blackbox_exporter Including some ready-made modules , for example HTTP、TCP、POP3S、IRC and ICMP.blackbox.yml You can extend the configuration , To add other modules to meet the needs of users .blackbox_exporter One satisfying feature is , If the module uses TLS/SSL, be Exporter Will be automatically exposed when the certificate chain expires , This makes it easy to deal with SSL The certificate issues an alarm .

Exporter A wide variety , Every Exporter And they're all independent , Each component performs its own function . however Exporter The more , The greater the maintenance pressure , Especially developed internally Agent Tools like this need a lot of manpower to complete resource control 、 Feature addition 、 Version upgrade, etc , Consider replacing it with Influx Data Company open source Telegraf? Unified management .Telegraf It's a use. Golang Open source for data collection Agent, It's based on plug-in drivers .Telegraf The plug-ins are very rich in input and output , When users have special needs , You can also write your own plug-ins ( Recompile required ), It's in Influx Data The location in the architecture is shown in the figure 4 Shown .

chart 4 Telegraf stay Influx Data Location map in Architecture

Telegraf Namely Influx Data The company's time series platform TICK( A high performance sequential midrange ) In the technology stack “T”, It is mainly used to collect time series data , Like servers CPU indicators 、 Memory metrics 、 Various IoT Data generated by the device, etc .Telegraf Supports all types of Exporter Integration of , Can achieve Exporter More in one . Another way of thinking is to pull up multiple Exporter process , The version can still be updated with the community .

Telegraf Of CPU And memory usage is extremely low , Support almost all integrated monitoring and rich community integration Visualization , Such as Linux、Redis、Apache、StatsD、Java/Jolokia、Cassandra、MySQL etc. . because Prometheus and InfluxDB They're all time series storage and monitoring systems , You can flexibly put Telegraf Dock to Prometheus in . In practice POC Environment validation , Use Telegraf Integrate Prometheus It's better than using it alone Prometheus Will have lower memory usage and CPU Usage rate .

Pushgateway

Prometheus It's a pull mode based monitoring system , Its push mode is through Pushgateway Component implemented .Pushgateway It's Pro temporality Job The middle gateway of active push index , It's essentially a way to monitor Prometheus The solution to the resource that the server cannot grab . It also uses Go language-written , stay Apache 2.0 Open source under license .

Pushgateway As an independent service , Located in the application and Prometheus Between servers . Push to application metrics Pushgateway,Pushgateway Receiving indicators , then Pushgateway Also as a target By Prometheus The server grabs . Its usage scenarios are as follows :

  • temporary / Short assignments

  • Batch job

Application and Prometheus There is network isolation between servers , Such as security ( A firewall )、 connectivity ( Not in a segment , The server or application allows only specific port or path access ).

Pushgateway Similar to gateway , stay Prometheus Is recommended as a temporary solution , It is mainly used to monitor the resources that are not easy to access . It's going to lose a lot Prometheus The functions provided by the server , such as UP Instance status monitoring when indicators and indicators are expired .

Pushgateway One of the common problems with is , It has a single point of failure problem . If Pushgateway There's a breakdown in collecting metrics from many different sources , Users will lose control of all these sources , Many unnecessary alarms may be triggered .

Use Pushgateway Another question to keep in mind is ,Pushgateway Any index data pushed to it will not be automatically deleted . therefore , You have to use Pushgateway Of API Delete expired metrics from push gateways .

Curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job/instance/ some_instance

Pushgateway And firewalls and NAT problem . It is recommended that Prometheus Move behind the firewall , Give Way Prometheus Closer to the target of the acquisition .

Be careful ,Pushgateway Will lose Prometheus adopt UP Monitor Indicator check instance health function , here Prometheus Corresponding to the pull state UP Indicators are only for single Pushgateway Service .

Service discovery (Service Discovery)

As the preferred solution for next generation monitoring system ,Prometheus Through the service discovery mechanism, it provides perfect support for monitoring scenarios in cloud and container environment .

In addition to supporting file service discovery (Prometheus Periodically reads the latest from the file target Information ) Outside ,Prometheus It also supports a variety of common service discovery components , Such as Kubernetes、DNS、Zookeeper、Azure、EC2 and GCE etc. . for example ,Prometheus have access to Kubernetes Of API Get changes in container information ( Such as the creation and deletion of containers ) To dynamically update monitoring objects .

For service discovery of supporting files , In practice scenario, it can be derived from automatic configuration management tools (Ansible、Cron Job、Puppet、SaltStack etc. ) Use a combination of .

Through service discovery , The administrator can not restart Prometheus In the case of services, dynamic discovery needs to be monitored target Instance information . There is an advanced operation in service discovery , Namely Relabeling Mechanism .Relabeling The mechanism will start from Prometheus Contains target Instance to get the default meta tag information , So as to different development environment ( test 、 pre-release 、 on-line )、 Different business teams 、 Different organizations, etc. according to certain rules ( Such as tag ) Returned from the service discovery registry target Some of the examples are selectively collected Exporter Instance monitoring data .

As opposed to using file configuration directly , In the cloud environment and container environment, more of our monitoring objects are dynamic . Under the actual situation ,Prometheus As the next generation monitoring solution , It is more suitable for monitoring requirements in cloud and container environment , There is also a lot of work in the service discovery process ( Such as Relabeling Mechanism ) You can bless .

Prometheus The server (Prometheus Server)

Prometheus Server is Prometheus The core module . It mainly includes grabbing 、 Store and query this 3 Features , Pictured 5 Shown .

chart 5 Prometheus Server function

Grab :Prometheus Server Discovering components through services , Periodically from the above Job、Exporter、Pushgateway this 3 Of the components HTTP Pull the monitoring index data in the form of polling .

Storage : The captured monitoring data is cleaned up and sorted out by certain rules ( Use service discovery to provide before crawling relabel_configs Method , After grabbing, use the in job metrics_relabel_configs Method ), The results will be stored in a new time series for persistence . these years , The memory module has been redesigned many times ,Prometheus 2.0 This is the third iteration of the storage system . The storage system can handle the intake of millions of samples per second , Make use of one Prometheus It's possible for servers to monitor thousands of machines . The compression algorithm used can realize each sample on real data 1.3B. It is recommended to use SSD, But not strictly .

Prometheus Storage is divided into local storage and remote storage .

  • The local store : Will be reserved directly to the local disk , It is recommended to use SSD And don't save more than a month's data . remember , Any version of Prometheus Don't support NFS. Some actual production cases tell us that ,Prometheus Store files if you use NFS, There is a possibility of damage or loss of historical data .

  • Remote storage : It is suitable for storing a large amount of monitoring data .Prometheus Supported remote storage includes OpenTSDB、InfluxDB、Elasticsearch、Graphite、CrateDB、Kakfa、PostgreSQL、TimescaleDB、TiKV etc. . Remote storage needs to be transformed with the adapter of the middle layer , Mainly involves Prometheus Medium remote_write and remote_read Interface . In actual production , There are all kinds of problems with remote storage , It needs to be constantly optimized 、 Pressure measurement 、 Architecture transformation and even rewriting the module of uploading data logic .

Inquire about :Prometheus After persisting the data , The client can go through PromQL Statement to query the data . More on that later PromQL The function of .

Dashboard

stay Prometheus It is mentioned in the architecture diagram that ,Web UI、Grafana、API client It can be uniformly understood as Prometheus Of Dashboard.Prometheus In addition to the built-in query language PromQL outside , It also supports the expression browser and the data graphical interface on the expression browser . Used in practical work Grafana As a front-end display interface , Users can also use Client towards Prometheus Server Send a request to get data .

Alertmanager

Alertmanager Is independent of Prometheus An alarm component of , Need to install and deploy separately .Prometheus Multiple Alertmanager Configure as a cluster , Through service discovery, it can dynamically find the nodes' online and offline in the alarm cluster, so as to avoid the single point problem ,Alertmanager It also supports the communication between multiple instances in the cluster , Pictured 6 Shown .

chart 6 Prometheus Alertmanager colony

Alertmanager receive Prometheus Push the alarm , Used to manage 、 Integrate and distribute alerts to different destinations .Alertmanager Provides a variety of built-in third-party alarm notification methods , At the same time, it also provides a reference for Webhook Notification support , adopt Webhook Users can complete more personalized expansion of alarms .Alertmanager In addition to providing basic alarm notification capabilities , It also provides groups such as 、 Alarm characteristics such as suppression and silence .

Message benefits

This article is excerpted from 《Prometheus Cloud native monitoring : Operation and maintenance and development 》, Issued by authorization of Publishing House . In order to thank the fans for their support 6 This book is for welfare . Welcome to talk in the message area Use Prometheus What I learned , end 2020 year 12 month 8 Japan 12 when , Before choosing a message to like 6 name , Each sent out a copy of this book ! Students who don't win the prize can click the link below to buy .

Kubernetes Practical training

Kubernetes Practical training will be held in 2020 year 12 month 25 The class was opened in Shenzhen on July th ,3 Day time will take you to master the system Kubernetes, Learning is not good, you can continue to learn . This training includes : Introduction to cloud origin 、 Microservices ;Docker Basics 、Docker working principle 、 Mirror image 、 The Internet 、 Storage 、 Data volume 、 Security ;Kubernetes framework 、 Core components 、 Common objects 、 The Internet 、 Storage 、 authentication 、 Service discovery 、 Scheduling and quality of service assurance 、 journal 、 monitor 、 The alarm 、Helm、 Practical cases, etc , Click on the picture below or read the link to see the details .

版权声明
本文为[osc_ b6tyukpz]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201208142314175u.html