01 Introduction to didi Nightingale
Didi nightingale is a distributed and highly available operation and maintenance monitoring system , The biggest feature is hybrid cloud support , It can support the scene of traditional physical machine and virtual machine , Can also support K8S The scene of the container . meanwhile , Didi nightingale is not just monitoring , There is another part CMDB The ability of 、 The ability to automate operation and maintenance , Many companies develop their own operation and maintenance platform based on nightingale . This part of the open source function module is also part of the commercial version , So reliability is guaranteed 、 Will continue to maintain , You can use it safely .
02 The main function of didi nightingale is
This section describes the core functional modules of didi nightingale , You can focus on the user resource center and monitoring alarm part .
》 User resource center
This is a platform base , All the operation and maintenance systems , We all need to rely on this system , Built in users 、 jurisdiction 、 role 、 organization 、 Management of resources . The core is an organization resource tree , The categories and extended fields of tree nodes can be customized , The simplest way to organize the hierarchical structure of an organization resource tree is ： Tenant ---> project ---> modular , A little more complicated organization ： Tenant ---> organization ---> project ---> modular ---> colony , Organizations can be nested .
There are two types of objects hanging on nodes , One is personnel authority , One is resources , Resources can be all kinds of resources , Except for the host device 、 Network devices , It can also be rds example ,redis example , Of course , This requires rds、redis Our management and control system and RDB Opened the . Didi is doing some big commercial solutions in the background ,RDB It plays the role of such a base .
》 Asset management system
The asset management system here , It is the management of hardware assets , The users of this system are usually from the system department , Asset management personnel , Application operation and maintenance pay relatively little attention to this system . The open source version opens up the management of a host device , You can open source again , Add some network device management 、 Management of cabinet rack position 、 Management of accessories and consumables, etc , With the base , It's relatively easy to grow some other systems on it .
agent After installation , Will be automatically registered to the asset management system , Automatic acquisition of the machine's sn、ip、cpu、mem、disk Etc , This information is for flexibility , It's all used shell Collected , As mentioned in the installation steps chapter above , The most important one is ip, There are a lot of devices in the system ,ip It needs to be global and unique , Other sn、ip、cpu、mem、disk etc. , If the acquisition fails , It can be written as a fixed value ,shell It says echo One false data is enough .
Every asset , Each has a tenant field , Represents the ownership of assets , Need administrator to assign asset ownership （ Modify the tenant of the asset ）, Each tenant can use the corresponding assets , After distribution , Will appear in the user resource center “ Free resources ” The menu , Each tenant can link the free resources to the asset tree and manage and use them in different categories . The tree node is created by right clicking on the tree .
》 The center carries out the mission
For batch run scripts , similar pssh、ansible、saltstack, But not for playbook, The greatest truths are the simplest , Just use the script ,shell、python、perl、ruby, Will do , As long as there's a parser on the machine . Because it's built into the Nightingale , So systematization would be better , And the authority of the organization resource tree is connected , You can control different people and have different permissions on different machines , Some people can use root Account execution , Some people can only use ordinary accounts , Historical execution records can be accessed through web Page view audit . The task itself supports some control ： Pause point 、 Tolerance 、 Single machine timeout 、 A stopover 、 Cancel on the way 、 halfway Kill etc. .
Some scripts that we often run , It can be made into a template , Templates are a way of managing scripts , Later, you can create tasks based on the template , Just fill in a list of machines to execute . such as install JDK, adjustment TCP Kernel parameters , adjustment ulimit Wait for the machine to initialize the script , Can be made into templates .
Open source version of the task execution Center , It can be seen as a command channel , Later, we can build some scenario applications based on this command channel , such as Machine initialization platform 、 Service change publishing platform 、 Configure the distribution system etc. . There are all kinds of operations in the task execution center API External exposure , Please refer to ：router.go Our command channel performs more tasks per week than 60 ten thousand , It's because all kinds of upper layer businesses rely on the ability of this command channel .
》 Monitoring alarm system
This core logic and v2 There's not much difference between versions , Monitoring indicators are divided into equipment related indicators and equipment independent indicators , Because there are some scenarios of custom monitoring data ,endpoint Bad definition , perhaps endpoint Constant change , This can be handled in a device independent way . The monitoring market has been optimized , More types of charts have been introduced , But didi nightingale is a metrics The monitoring system , It deals with numerical time series data , therefore , The most useful chart is actually a line chart , Other types of charts , It is good to look at , There are fewer scenes . Didi Nightingale can also dock Grafana, There's a special one DataSource plug-in unit ,Grafana It will be more cool , It's just , The performance is poor when the amount of data is large .
03 Didi Nightingale architecture
Explain a few key points ：
- agent Initiative and job establish tcp A long connection , Pull script task execution , And report the results to ;
- agent Active call ams Of http Interface , Report your basic information ;
- agent Active call monapi The interface of , Pull acquisition strategy , Such as process 、 port 、 journal 、 Related collection strategies of plug-ins ;
- agent Initiative and transfer establish tcp A long connection , Push monitoring data ;
- transfer Push the received monitoring data to tsdb Do data persistence , One for judge Make alarm judgment ;
- index Index used to store monitoring data ,tsdb+index You can also use m3db Replace , Didi Nightingale supports a variety of back-end storage mechanisms ;
- judge It's the alarm engine , Periodically from monapi Pull alert strategy , Make threshold judgment on the received data , Generate alarm events , Push alarm events to redis,monapi from redis Consume these alert events , Persist events to the database , And send alarm notification as required .
04 Didi Nightingale information
- file ：https://github.com/didi/nightingale/wiki
- video ： Official account “ Operation and maintenance of scattered troops ”, View history messages
- Add group ： Add wechat friends “UlricQin”, remarks “ Nightingale plus group ”
05 Corporate support
- Enterprise users who use the open source version in production , You can join OCE, We will give extra and better support , For example, the exclusive Technology Salon 、 One on one communication opportunities for enterprises 、 Exclusive Q & a group, etc .OCE The application portal is in Obsuite In the official account menu , Click on 【OCE authentication 】 You can also apply directly .
- If you want to have more powerful functions , More stable business support , You can see our commercial version , The introduction portal of commercial version is also in Obsuite In the official account menu .