当前位置:网站首页>One stop operation and maintenance monitoring of hybrid cloud -- didi Nightingale

One stop operation and maintenance monitoring of hybrid cloud -- didi Nightingale

2021-01-23 16:58:21 Obsuite

01 Introduction to didi Nightingale

Didi nightingale is a distributed and highly available operation and maintenance monitoring system , The biggest feature is hybrid cloud support , It can support the scene of traditional physical machine and virtual machine , Can also support K8S The scene of the container . meanwhile , Didi nightingale is not just monitoring , There is another part CMDB The ability of 、 The ability to automate operation and maintenance , Many companies develop their own operation and maintenance platform based on nightingale . This part of the open source function module is also part of the commercial version , So reliability is guaranteed 、 Will continue to maintain , You can use it safely .

02 The main function of didi nightingale is

This section describes the core functional modules of didi nightingale , You can focus on the user resource center and monitoring alarm part .

》 User resource center

This is a platform base , All the operation and maintenance systems , We all need to rely on this system , Built in users 、 jurisdiction 、 role 、 organization 、 Management of resources . The core is an organization resource tree , The categories and extended fields of tree nodes can be customized , The simplest way to organize the hierarchical structure of an organization resource tree is : Tenant ---> project ---> modular , A little more complicated organization : Tenant ---> organization ---> project ---> modular ---> colony , Organizations can be nested .

There are two types of objects hanging on nodes , One is personnel authority , One is resources , Resources can be all kinds of resources , Except for the host device 、 Network devices , It can also be rds example ,redis example , Of course , This requires rds、redis Our management and control system and RDB Opened the . Didi is doing some big commercial solutions in the background ,RDB It plays the role of such a base .

》 Asset management system

The asset management system here , It is the management of hardware assets , The users of this system are usually from the system department , Asset management personnel , Application operation and maintenance pay relatively little attention to this system . The open source version opens up the management of a host device , You can open source again , Add some network device management 、 Management of cabinet rack position 、 Management of accessories and consumables, etc , With the base , It's relatively easy to grow some other systems on it .

agent After installation , Will be automatically registered to the asset management system , Automatic acquisition of the machine's sn、ip、cpu、mem、disk Etc , This information is for flexibility , It's all used shell Collected , As mentioned in the installation steps chapter above , The most important one is ip, There are a lot of devices in the system ,ip It needs to be global and unique , Other sn、ip、cpu、mem、disk etc. , If the acquisition fails , It can be written as a fixed value ,shell It says echo One false data is enough .

Every asset , Each has a tenant field , Represents the ownership of assets , Need administrator to assign asset ownership ( Modify the tenant of the asset ), Each tenant can use the corresponding assets , After distribution , Will appear in the user resource center “ Free resources ” The menu , Each tenant can link the free resources to the asset tree and manage and use them in different categories . The tree node is created by right clicking on the tree .

》 The center carries out the mission

For batch run scripts , similar psshansiblesaltstack, But not for playbook, The greatest truths are the simplest , Just use the script ,shell、python、perl、ruby, Will do , As long as there's a parser on the machine . Because it's built into the Nightingale , So systematization would be better , And the authority of the organization resource tree is connected , You can control different people and have different permissions on different machines , Some people can use root Account execution , Some people can only use ordinary accounts , Historical execution records can be accessed through web Page view audit . The task itself supports some control : Pause point Tolerance Single machine timeout A stopover Cancel on the way halfway Kill etc. .

Some scripts that we often run , It can be made into a template , Templates are a way of managing scripts , Later, you can create tasks based on the template , Just fill in a list of machines to execute . such as install JDK, adjustment TCP Kernel parameters , adjustment ulimit Wait for the machine to initialize the script , Can be made into templates .

Open source version of the task execution Center , It can be seen as a command channel , Later, we can build some scenario applications based on this command channel , such as Machine initialization platform Service change publishing platform Configure the distribution system etc. . There are all kinds of operations in the task execution center API External exposure , Please refer to :router.go Our command channel performs more tasks per week than 60 ten thousand , It's because all kinds of upper layer businesses rely on the ability of this command channel .

》 Monitoring alarm system

This core logic and v2 There's not much difference between versions , Monitoring indicators are divided into equipment related indicators and equipment independent indicators , Because there are some scenarios of custom monitoring data ,endpoint Bad definition , perhaps endpoint Constant change , This can be handled in a device independent way . The monitoring market has been optimized , More types of charts have been introduced , But didi nightingale is a metrics The monitoring system , It deals with numerical time series data , therefore , The most useful chart is actually a line chart , Other types of charts , It is good to look at , There are fewer scenes . Didi Nightingale can also dock Grafana, There's a special one DataSource plug-in unit ,Grafana It will be more cool , It's just , The performance is poor when the amount of data is large .

03 Didi Nightingale architecture

Explain a few key points :

  • agent Initiative and job establish tcp A long connection , Pull script task execution , And report the results to ;
  • agent Active call ams Of http Interface , Report your basic information ;
  • agent Active call monapi The interface of , Pull acquisition strategy , Such as process 、 port 、 journal 、 Related collection strategies of plug-ins ;
  • agent Initiative and transfer establish tcp A long connection , Push monitoring data ;
  • transfer Push the received monitoring data to tsdb Do data persistence , One for judge Make alarm judgment ;
  • index Index used to store monitoring data ,tsdb+index You can also use m3db Replace , Didi Nightingale supports a variety of back-end storage mechanisms ;
  • judge It's the alarm engine , Periodically from monapi Pull alert strategy , Make threshold judgment on the received data , Generate alarm events , Push alarm events to redis,monapi from redis Consume these alert events , Persist events to the database , And send alarm notification as required .

04 Didi Nightingale information

  • file :https://github.com/didi/nightingale/wiki
  • video : Official account “ Operation and maintenance of scattered troops ”, View history messages
  • Add group : Add wechat friends “UlricQin”, remarks “ Nightingale plus group ”

05 Corporate support

  • Enterprise users who use the open source version in production , You can join OCE, We will give extra and better support , For example, the exclusive Technology Salon 、 One on one communication opportunities for enterprises 、 Exclusive Q & a group, etc .OCE The application portal is in Obsuite In the official account menu , Click on 【OCE authentication 】 You can also apply directly .
  • If you want to have more powerful functions , More stable business support , You can see our commercial version , The introduction portal of commercial version is also in Obsuite In the official account menu .

版权声明
本文为[Obsuite]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/01/20210123165713231i.html