当前位置:网站首页>Online service has been dry exploded, it is the pot of log!!

Online service has been dry exploded, it is the pot of log!!

2020-12-07 10:11:06 HollisChuang

This article is about a real case that happened in our online environment , The problem happened during a big promotion , It has a great impact on our online cluster , This article is a brief review of the problem . It's easy for you to understand , The actual investigation and resolution process may not be exactly the same as described in this paper , But the idea is the same .

Problem process

During a promotion , A large number of alarms suddenly occur in an online application , Indicates that the disk utilization is too high , Once upon a time it reached 80% many .

In this case, we log in to the online machine for the first time , Check the disk usage of the online machine . Use command :df Check disk usage .

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120   93% /home/admin

It is found that the machine disk consumption is quite serious , Because there are more requests during the promotion period , So we started to wonder if there were too many logs , Causing disk exhaustion .

Here's a background , Our online machine is configured with automatic compression and cleaning of logs , A single file reaches a certain size , Or after the machine content reaches a certain threshold , Will trigger automatically .

But the big promotion did not trigger the cleaning of the log on that day , This caused the machine disk to run out for a time .

After investigation , We found some of the applications Log file , Takes up a lot of disk space , And it's growing .

du -sm * | sort -nr
512 service.log.20201105193331
256 service.log
428 service.log.20201106151311
286 service.log.20201107195011
356 service.log.20201108155838

du -sm * | sort -nr : Count the file size in the current directory , And sort by size

So after communicating with the operation and maintenance students , We decided to deal with it urgently .

The first way to do this is to clean up the log files manually , After the operation and maintenance students log in to the server , Manually cleaned up some unimportant log files .

rm service.log.20201105193331

But after executing the cleanup order , It was found that the disk utilization rate on the machine did not decrease , And it's still growing .

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120  93% /home/admin

So we started to figure out why the log was deleted , Disk space has not been released , Through the command , We found out that a process is still reading the file .

lsof |grep deleted
SLS   11526  root   3r   REG   253,0 2665433605  104181296 /home/admin/****/service.log.20201205193331 (deleted)

lsof |grep deleted The role of is : View all open files and filter out the deleted ones

After investigation , This process is a SLS process , Constantly reading log content from the machine .

SLS It is a log service of Ali , Provide one-stop data collection 、 cleaning 、 analysis 、 Visualization and alarm function . In short, it will collect the log on the server to , Persistence , And then for inquiry 、 Analysis etc. .

Our online logs will pass SLS To collect , therefore , Through analysis , We found that disk space wasn't released , and SLS Log reading about .

Come here , The problem is basically located in , So let's break in the principle , Introduce the background behind this .

Background knowledge

Linux In the system is through link To control the number of deleted files , Only when a file does not exist any link When , This file will be deleted .

Generally speaking , It's in every file 2 individual link Counter :i_count and i_nlink, in other words :Linux There's only... In the system i_nlink And i_count All for 0 When , This file will actually be deleted .   

  • i_count Represents the current user of the file ( Or called ) The number of
  • i_nlink Indicates the number of media connections ( Number of hard links );
  • It can be understood as i_count It's a memory reference counter ,i_nlink It's a disk reference counter .   

When a file is referenced by a process , Corresponding i_count The number will increase ; When creating hard links to files , Corresponding i_nlink The number will increase .

stay Linux perhaps Unix In the system , adopt rm Or file manager to delete files , Just unlink it from the directory structure of the file system (unlink), It's actually reducing the disk reference count i_nlink, But it doesn't reduce i_count Count .

If a file is being called by a process , The user to use rm Order the file to be " Delete " 了 , At this time through ls You can't find this file after the file management command , But it doesn't mean that the file is actually deleted from the disk .

Because there is another process in normal execution , To read or write to a file , That is to say, the document is not really " Delete ", So the disk space will always be occupied .

And our online problem is this principle , Because a process is operating on the log file , So actually rm The operation does not actually delete the file , So the disk space hasn't been released .

Problem solving

After understanding the online problem phenomenon and the above related background knowledge , We can think of a way to solve this problem .

That is to find a way to SLS The process's reference to this log file is destroyed , Files can actually be deleted , Disk space can be really released .

kill -9 11526
$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 50331648 12582912  80% /
/dev/sda2       62914560 50331648 12582912  80% /home/admin

As a special reminder , In execution kill -9 Before , Be sure to consider the consequences of execution , The principle behind it can be referred to : I go to the server to execute kill -9 after , I was told not to come the next day !

After the event , We went through a double check , Find out why this problem arises , There are two main reasons :

  • 1、 Too much online log printing , Too often
  • 2、SLS Log pull speed is too slow

In depth analysis, we found that , This app prints a lot of process logs , The original log printing was to facilitate the troubleshooting of online problems , Or for data analysis , however During the promotion period, the amount of logs increased , This leads to a rapid increase in disk space usage .

in addition , Because this app shares one with several other big apps SLS Of project, Lead to SLS The pull speed is pulled down , And that leads to the process never ending .

After the event , We also summarize some improvements , For the second question , What we have for this application is SLS Configure to split , Manage independently .

For the first question , That's it During the promotion period, log degradation strategy is introduced , Once the disk is full , Is to demote the log .

About log demotion , I developed a common tool , It's through configuration , Dynamic push log level , Dynamically modify the online log output level . And the modification of this configuration is configured to our plan platform , During the promotion period, regular or emergency plans should be carried out , To avoid this problem .

On the log degradation tool development ideas and related code , I'll share it with you in the next article .

reflection

After every big promotion, we'll resume , You will find that most of the problems are caused by a few small problems .

In the process of problem analysis, we often need to apply a lot of knowledge related to non development skills , Such as operating system 、 computer network 、 database 、 Even hardware related knowledge .

So I always thought , Judge whether a programmer is a cow X, It depends on his ability to solve problems !

版权声明
本文为[HollisChuang]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012071008410080.html