当前位置:网站首页>Record an interface slow check and troubleshooting

Record an interface slow check and troubleshooting

2021-08-08 15:38:16 Tian Xiaobo

1. Preface

This article records the troubleshooting process of an interface slow query , The phenomenon caused by this problem is confusing . At the same time, due to the high incidence of problems , The investigation is also difficult . The troubleshooting process starts from druid data source “ Lead to ” As an entry point , Step by step analysis , After excluding many possibilities, there is still no solution . Then re-examine the fault phenomenon , From another angle , Found the root cause of the problem . Finally, the cause of the problem is verified , Results in line with expectations . Here we are , The investigation process is over , This paper records and archives the problems .

2. Problem description

I received feedback from business colleagues some time ago , They said that a machine they applied had two slow checks on multiple interfaces for two consecutive days ( Occasional high ). But the duration is short , But it soon recovered .Pinpoint The call chain information is as follows :

chart 1: Long RT Interface call chain information

The above figure shows the feedback from business students , It can be seen from MyBatis Of SqlSessionTemplate#selectList Method to MySQL Of ConnectionImpl#prepareStatement Between methods 2111 Millisecond interval , It is this interval that causes the interface RT elevated . Next, analyze this phenomenon .

3. The screening process

3.1 Go straight to the theme

From the link information given by the full link tracking system , The cause of the problem seems obvious , Namely selectList and prepareStatement There are long and time-consuming operations . In theory, just find out where there is a long-time operation , And why , The problem is solved . So let's ignore three, seven, twenty-one , Direct analysis MyBatis The source code. .

public class SimpleExecutor extends BaseExecutor {
    
    public <E> List<E> doQuery(MappedStatement ms, Object parameter, RowBounds rowBounds, ResultHandler resultHandler, BoundSql boundSql) throws SQLException {
        Statement stmt = null;
        try {
            Configuration configuration = ms.getConfiguration();
            StatementHandler handler = configuration.newStatementHandler(wrapper, ms, parameter, rowBounds, resultHandler, boundSql);
            stmt = prepareStatement(handler, ms.getStatementLog());    //  precompile 
            return handler.<E>query(stmt, resultHandler);
        } finally {
            closeStatement(stmt);
        }
    }
 
    private Statement prepareStatement(StatementHandler handler, Log statementLog) throws SQLException {
        Statement stmt;
        Connection connection = getConnection(statementLog);    //  Get database connection 
        stmt = handler.prepare(connection, transaction.getTimeout());    //  Precompiling 
        handler.parameterize(stmt);
        return stmt;
    }
}

MyBatis Of SqlSessionTemplate#selectList Method has a long calling chain , We will not analyze them one by one here , If you are interested, please read my article three years ago MyBatis Source code analysis - SQL Implementation process of .SqlSessionTemplate#selectList Eventually it will be executed SimpleExecutor#doQuery, This method executes the following logic before , Will call first SimpleExecutor#prepareStatement Precompile .prepareStatement Method involves two external operations , One is to get the database connection , The other is to execute the call MySQL Drive related methods to perform precompiling .

From the picture 1 In terms of monitoring , Precompiling is fast , You can be sure that there is no problem with precompiling . Now? , Turn your attention to getConnection On the way , This method will eventually be applied to the underlying druid The data source applies for a database connection .druid The producer consumer model is used to maintain the connection pool , For a more detailed introduction, refer to my other article article . If there are no available connections in the connection pool , At this time, the business thread will wait druid Producers create connections . If the producer creates a connection slowly , Or the number of active connections has reached the maximum , At this point, the business thread must wait . Fortunately, the business application received druid monitor , We can monitor the connection pool .

chart 2:druid Control chart

The picture above is used excel Draw the , The data has been edited , There is a certain deviation from the situation at that time , But it does not affect the subsequent analysis . From the monitoring point of view , All the free connections in the connection pool have been lent out , But still not enough , So the producer keeps creating connections . This has triggered our thinking , Why does the number of active connections suddenly rise so much ? Maybe there's a slow check . meanwhile , The waiting times and waiting time of business threads have increased significantly , This explanation druid Connect the producer's “ capacity ” It seems insufficient , So that some business threads wait . Under what circumstances do producers have problems creating connections ? The situation we considered at that time was as follows :

  1. There is delay or packet loss in the network , Cause retransmission
  2. Alibaba cloud RDS The instance load is very high , Unable to respond to the client's connection establishment request in time

Let's start with the second situation ,RDS The load problem is well confirmed , Alicloud has RDS Monitoring of . We confirmed the monitoring curve when the problem occurred twice , It is found that the instance load is not high , No significant fluctuations , So case two can rule out . What about situation one , Will there be problems with the network ? This conjecture is hard to rule out . Here's why :

The number of active connections is normally very low , Skyrocketing is generally not normal . If one Java The thread sends a message from SQL Time from request to receipt of data 5ms, Then a connection can reach 200 Of QPS. And this one QPS Not enough 200, You shouldn't use so many connections , Unless there's a slow check . But we use Alibaba cloud's SQL I didn't find it slow in the insight service SQL, Therefore, the possibility of slow check can be ruled out . Not slow , It seems that we can only throw the pot to the network , It must be the Internet at that time ( A good pot catcher ) Something went wrong. .

If there is something wrong with the network , that druid producer “ capacity ” The problem of insufficiency seems to be reasonably explained . But after our analysis , Found some contradictions , as follows :

chart 3: Some long RT Requested link tracking data

In terms of link data ,SQL The precompile and execution time of are very fast , There is no obvious waiting time . If the above situation is accidental , But we looked at a lot of link data , All found SQL The precompile and execution speed of are very fast , There is no significant delay . therefore , It's not appropriate to leave the pot to the Internet .

Check it out here , The train of thought is broken . First of all, there is no slow check , Secondly, database resources are sufficient , Finally, the network seems to be ok . No problem , So what's the problem ?

3.2 Expand the scope of information

3.2.1 essential information

First, check the problem machine QPS, No burst traffic found . Although there are some fluctuations , But the overall situation remains stable .

chart 4:QPS Broken line diagram

Then I looked at the application CPU Usage rate , I found a problem , Usage has suddenly increased a lot .

chart 5:CPU Usage line chart

Asked business students , There is no scheduled task at this point ,QPS Similar to the past , Nothing unusual . I don't know at the moment CPU Why did it suddenly rise so much , This information can not provide effective help for the time being , Put it first. . Finally, take a look at the Internet I/O monitor .

chart 6: Network traffic monitoring diagram

Both inbound traffic and outbound traffic increased during the time period when the problem occurred , The outbound flow increases greatly , It shows that there should be a lot of data sent out at that time . But the specific interface behavior , Not yet known .

3.2.2 Business log information

Then I analyze the business log , We found some WARN Level information :

#  It's typed by the business thread  WARN  journal , Indicates that waiting for connection timed out , Try again 
2021-07-20 09:56:42.422 [DubboServerHandler-thread-239] WARN  com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1
2021-07-20 09:56:42.427 [DubboServerHandler-thread-242] WARN  com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1
2021-07-20 09:56:42.431 [DubboServerHandler-thread-240] WARN  com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1

# Dubbo TimeoutFilter  Q & a timeout log 
2021-07-20 09:56:42.432 [DubboServerHandler-thread-254] WARN  org.apache.dubbo.rpc.filter.TimeoutFilter [TimeoutFilter.java:60] -  [DUBBO] invoke time out. method: xxx arguments: [yyy] , url is dubbo://172.***.***.***:20880/****
2021-07-20 09:56:42.489 [DubboServerHandler-thread-288] WARN  org.apache.dubbo.rpc.filter.TimeoutFilter [TimeoutFilter.java:60] - [DUBBO] invoke time out. method: **** arguments: [***, ***] , url is dubbo://172.***.***.***:20880/****

These two kinds of logs are useless to tell the truth , Because these logs are the result , It should have happened . except WARN Information , No exception information can be found in the business log . It should be noted that , We do not set the number of retries for business threads to get connections , The default number of retries is 0. But there was a retry , We expect that after the business thread fails to obtain the connection within the specified time , A should be thrown GetConnectionTimeoutException abnormal . This should be druid There is a problem with the code for , I gave them one issue (#4381), But no one replied . This problem is also easy to fix , Count as a good first issue, Interested friends can mention Pull Request Fix it .

There is not much information available in the business log , We continue to look for other clues , This time, we started from JVM There's a clue in the surveillance .

3.2.3 GC Information

In the short period of time when the problem occurred ,JVM Many times YGC and FGC, And part of it GC It takes a long time to do it .

chart 7:GC Frequency and time-consuming monitoring

So here's the problem ,GC Is it the cause or the result ? because GC Both the time of occurrence and the time of occurrence of interface slow query are 9.56:30 after , Everyone overlaps in time , Who influenced who , Or influence each other , This is not clear .GC The introduction of seems to complicate the investigation .

I've collected almost all the information here , But the cause of the problem is still not inferred , Gloomy. .

3.3 Reexamine

3.3.1 Comprehensive analysis of

If we still start from druid In this direction , It's hard to find out why . Precious time , Now it's time to rethink the information collected . Now that it happened GC, Then the memory consumption of the application must have increased . Then integrate the network I/O The situation of , Relatively large fluctuations occurred in a short time , It seems to explain this situation . But at that time, the network traffic was not particularly large , It doesn't seem to be enough to trigger GC. Insufficient support , Put it aside . in addition , Applied QPS Not much has changed , however CPU The load suddenly increased a lot . Plus several times GC It takes a long time to do it , Maybe there's a connection between them , That is, for a long time GC Led to CPU The load goes up . at present , There are two directions , One is from the network I/O Direction check , The other is from GC Direction check . In terms of phenomena ,GC The problem seems bigger , Therefore, the subsequent choice is from GC Direction check .

3.3.2 Change the way of thinking

JVM Conduct GC, It shows that the memory utilization must be up . Memory rise is a cumulative process , If we take the investigation time from occurrence to time-consuming GC The moment of 9:57:00 Push forward for a minute , Maybe you'll find something . So I went to the full link tracking platform in reverse order of time , The pull problem is applied to 9:56:00 The length of this minute RT Interface list , I found that the top ten are queryAll This method . as follows :

chart 8:queryAll Method takes time to reverse the list

Let's look at the call chain information of the longest request :

chart 9: Link information that takes the longest request

First we can see MySQL Driven execute The execution time is very long , The reason is analyzed later . secondly redis The read time of the cache is very short , No problem . but redis It takes a long time for the client to write data , This is very abnormal .

So he immediately asked the application developer for code permission , Analyzed the code . The pseudocode is as follows :

public XxxService {

    //  remarks : This method actually uses  Spring @Cacheable  annotation , Instead of displaying the operation cache 
    public List queryAll(int xxxId) {
        // 1、 Check cache , Hit and return immediately 
        List xxxDTOs = customRedisClient.get(xxxId);
        if (xxxDTOs != null) return list;
        
        // 2、 Cache miss , Query the database 
        List xxxDOs = XxxDao.queryAll(xxxId);
        xxxDTOS = convert(xxxDOs)
        
        // 3、 Write cache 
        customRedisClient.set(xxxId, xxxDTOs, 300s);
        return list;
    }
}

What the code does is very simple , Then why does it take so much time ? That's why , If the cache fails ,queryAll This method will extract tens of thousands of data from the database at one time , And the table structure contains some complex fields , Such as business rules , Mailing address, etc , Therefore, single line records are relatively large . In addition, after the data is taken out , Two more model transformations have been carried out (DO → DTO → TO), The number of models converted is half more than the number of original models , about 1.5 m ,TO Quantity and DTO Agreement . After the model conversion , Then there is the write cache , Write caching involves serialization .queryAll One method call will generate about five large data in the new generation , The first is the data set ResultSet, The second is DO list , The third is DTO list , The fourth TO list , The last one is DTO Serialized contents of the list . In addition, there are more than 20 calls in two seconds , Increased memory consumption , This should explain why GC The number will suddenly rise so much . Here are a few more questions , I use FAQ The way to answer :

Q: that GC It takes a long time. How to explain ?

A: I guess it may be caused by the garbage collector sorting and copying a large amount of memory data .

--------------------------------------------------------------------------------------------------

Q: also execute Methods and set Why is there such a long interval between methods ?

A: The current guess is that the transformation of the model class and the serialization itself will take some time , Secondly, there should be multiple serialization processes at the same time , But that doesn't explain why it took so long . But if we put GC Think about it , Will get a relatively reasonable explanation . from 9:56:33 ~ 5:56:52 There have been many times GC, And some GC It's a long time ( For a long period of time stop the world), The result is a long gap between the two methods . Actually, we can take a look at 9:56:31 first queryAll Requested call chain information , You will find that the interval is not so long :

chart 10:queryAll Time consuming under normal circumstances

therefore , We can think of the subsequent call chain execute and set The very long interval between methods is because CPU Usage rate ,GC And other factors .

--------------------------------------------------------------------------------------------------

Q: first set And the second set Why is the interval so long ?

A: first set It's our custom logic , To second set There seems to be nothing special between us , There was no problem . But fortunately, when it reappears , Found a clue , The following chapters explain .

--------------------------------------------------------------------------------------------------

Last , Let's turn our attention to the initial problem , That is, the business colleague feedback part interface RT The rising problem . I still use FAQ The way to answer .

Q: Why multiple interfaces RT There has been a rise ? Refer to the following figure for the call chain :

chart 11: Some long RT Link information of the interface

A: Some business threads are waiting druid Create database connection , because GC Happen , It's caused STW.GC It will affect the waiting logic . For example, a thread call awaitNanos wait for 3 Second , As a result, something happened during this period 5 Of a second GC(STW), So when GC At the end , The thread immediately timed out . stay druid The data source , maxWait Parameter controls the waiting time of the business thread , The code is as follows :

public class DruidDataSource extends DruidAbstractDataSource {
	private DruidPooledConnection getConnectionInternal(long maxWait) throws SQLException {
        // ...
        final long nanos = TimeUnit.MILLISECONDS.toNanos(maxWait);
        if (maxWait > 0) {
            //  Try to get a connection from the connection pool within the set time 
            holder = pollLast(nanos);
        } else {
            holder = takeLast();
        }
        
        // ...
    }
}

3.4 Preliminary conclusion

After the previous investigation , Many phenomena have been reasonably explained , It's time to draw a conclusion :

This time xxx Multiple interfaces of the application appear twice in a row in two days RT A sharp rise , Check it out , I think it is queryAll Method cache invalidation , Dozens of requests for mass query data in a short time 、 Model transformation and serialization , Consume a lot of memory , Triggered several long GC. Cause some business threads to wait for timeout , And then Dubbo Thread pool is full .

Next , We will reproduce this conclusion , To prove that our conclusion is correct .

4. Problem recurrence

If the problem reappears, try to simulate the situation at that time , Otherwise, it may cause large error . In order to accurately simulate the interface call at that time , I wrote one that can control QPS And the total number of requests .

public class XxxController {
    
    @Resource
    private XxxApi xxxApi;
    
	public Object invokeQueryAll(Integer qps, Integer total) {
        RateLimiter rl = RateLimiter.create(qps.doubleValue());
        ExecutorService es = Executors.newFixedThreadPool(50);
        for (Integer i = 0; i < total; i++) {
            es.submit(() -> {
                rl.acquire();
                xxxApi.queryAll(0);
            });
        }
        return "OK";
    }
}

The effect of reproduction is in line with expectations ,CPU Usage rate , The Internet I/O It's all up ( Because monitoring belongs to the company's internal system , No screenshots ). meanwhile GC It happened, too , And it takes a long time .GC The log is as follows :

2021-07-29T19:09:07.655+0800: 631609.822: [GC (Allocation Failure) 2021-07-29T19:09:07.656+0800: 631609.823: [ParNew: 2797465K->314560K(2831168K), 2.0130187 secs] 3285781K->1362568K(4928320K), 2.0145223 secs] [Times: user=3.62 sys=0.07, real=2.02 secs] 
2021-07-29T19:09:11.550+0800: 631613.717: [GC (Allocation Failure) 2021-07-29T19:09:11.551+0800: 631613.718: [ParNew: 2831168K->314560K(2831168K), 1.7428491 secs] 3879176K->1961168K(4928320K), 1.7443725 secs] [Times: user=3.21 sys=0.04, real=1.74 secs] 
2021-07-29T19:09:13.300+0800: 631615.467: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1646608K(2097152K)] 1965708K(4928320K), 0.0647481 secs] [Times: user=0.19 sys=0.00, real=0.06 secs] 
2021-07-29T19:09:13.366+0800: 631615.533: [CMS-concurrent-mark-start]
2021-07-29T19:09:15.934+0800: 631618.100: [GC (Allocation Failure) 2021-07-29T19:09:15.934+0800: 631618.101: [ParNew: 2831168K->2831168K(2831168K), 0.0000388 secs]2021-07-29T19:09:15.934+0800: 631618.101: [CMS2021-07-29T19:09:17.305+0800: 631619.471: [CMS-concurrent-mark: 3.668/3.938 secs] [Times: user=6.49 sys=0.01, real=3.94 secs] 
 (concurrent mode failure): 1646608K->1722401K(2097152K), 6.7005795 secs] 4477776K->1722401K(4928320K), [Metaspace: 224031K->224031K(1269760K)], 6.7028302 secs] [Times: user=6.71 sys=0.00, real=6.70 secs] 
2021-07-29T19:09:24.732+0800: 631626.899: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1722401K(2097152K)] 3131004K(4928320K), 0.3961644 secs] [Times: user=0.69 sys=0.00, real=0.40 secs] 
2021-07-29T19:09:25.129+0800: 631627.296: [CMS-concurrent-mark-start]
2021-07-29T19:09:29.012+0800: 631631.179: [GC (Allocation Failure) 2021-07-29T19:09:29.013+0800: 631631.180: [ParNew: 2516608K->2516608K(2831168K), 0.0000292 secs]2021-07-29T19:09:29.013+0800: 631631.180: [CMS2021-07-29T19:09:30.733+0800: 631632.900: [CMS-concurrent-mark: 5.487/5.603 secs] [Times: user=9.29 sys=0.00, real=5.60 secs] 
 (concurrent mode failure): 1722401K->1519344K(2097152K), 6.6845837 secs] 4239009K->1519344K(4928320K), [Metaspace: 223389K->223389K(1269760K)], 6.6863578 secs] [Times: user=6.70 sys=0.00, real=6.69 secs]

next , Let's take another look at the time-consuming situation of the interface during that period :

chart 12: When the problem reappears queryAll Time consuming reverse list

All interfaces take a long time , It's also in line with expectations . Finally, look at a long RT Link information of the interface :

chart 13: Some long RT Link information of the interface

You will find and pictures 1, That is, the feedback from business students is consistent , It shows that the reproduction effect meets the expectation .

Verify here , It can be proved that our conclusion is correct . Found the root of the problem , This problem can be filed .

5. Further exploration

5.1 Time consuming

For performance reasons ,Pinpoint The link information given is relatively rough , So that we can't know in detail queryAll What is the time-consuming composition of the method . In order to find out the details , I am right. queryAll Methods the time-consuming situation was calculated in detail . Trigger a request when the application load is low , And use Arthas Of trace Command link time consuming . The monitoring results are as follows :

chart 14: Under normal circumstances queryAll Method link information

Here I number three method calls , Method ① and ② between 252 Millisecond interval , Method ② and ③ between 294 Millisecond interval .Arthas The printed link time consumption is as follows :

chart 15:queryAll The method takes time to measure

The number here corresponds to the figure above one by one , The time-consuming calls have been marked with colors . First, we analyze the interval 1 The composition of the , Method ① To ② There are two time-consuming operations , One is batch model transformation , That is the DO Turn into DTO, Consumed about 79.6 millisecond . The second time-consuming operation is Object Of toString Method , about 171.6. The two add up to 251.2, It is consistent with the data given by the full link tracking system . Everyone here must be curious about why toString The method will consume so much time , The answer is as follows :

 public void put(Object key, Object value) {
     //  Omit the space judgment logic 
     
     //  hold  key  and  value  First convert to a string , Then judge the empty space 
     if (StringUtils.isBlank(key.toString()) || StringUtils.isBlank(value.toString())) {
         return;
     }
     CacheClient.set(key, value);
 }

This is the way ① and ② A method in the path , This code looks harmless to humans and animals , The problem occurs in the logic of space judgment ( History code , Do not discuss its rationality ). When value Is a very large collection or array ,toString It will take a lot of time . meanwhile ,toString A large string is also generated , Virtually consumed memory resources . Here is a seemingly insignificant code , It's actually a performance black hole . It also reminds us of , When handling large quantities of data , Pay attention to time and space consumption .

Last , Let's take another look at the method ② and ③ The interval between , The reason is obvious , Namely value The serialization process takes a lot of time . in addition , The serialized byte array will also be temporarily stored in heap memory , It will also consume a lot of memory resources .

Here, the whole analysis process is over , Through the above analysis , We can see that , A simple query can lead to so many problems . A lot of logic that used to look sparse and ordinary , Occasionally, it will become a performance killer . On a daily basis , You should always pay attention to the performance of the application , Escort the stable operation of the application .

5.2 Memory consumption measurement

Because when the problem occurs ,JVM It's just going on FGC, There is no memory overflow , So there is no memory at that time dump data . But the good thing is that the problem can recur stably , Through to JVM Do some configuration , We can get JVM happen FGC Memory is automatically updated before dump. Use here jinfo Command on the running JVM Process setting parameters :

jinfo -flag +HeapDumpBeforeFullGC $JVM_PID

After getting the memory data , Next use mat Tool analysis . Let's first look at memory leaks :

chart 16: Application memory leak analysis

In terms of memory consumption ratio , There are some problems , Mainly with dubbo Thread related . Choose any thread , Dominating the tree (dominator tree) View the object information dominated by the thread :

chart 17:dubbo Thread dominates the tree

As can be seen from the above figure ,dubbo Threads retained heap about 66 MB, It mainly consists of two large objects ArrayList and StringBuilder form .ArrayList What's in it is DTO, Single DTO Of retained heap It's about the size of 5.1 KB.StringBuilder Mainly toString The process produces , Consumed nearly 26 MB Of memory . from dump Not found in memory data DO list , It should be YGC It was recycled at the time .

Okay , That's all for the memory analysis , It's enough to know what memory consumption is , More in-depth analysis will not be done .

6. summary

Due to less troubleshooting experience , This problem also took a lot of time on and off . There is depression when there is no reason , There is also the joy of finding that some conjectures meet expectations . Anyway , Finally, I found the cause of the problem . While helping others , I also learned a lot . in general , It's worth it . At the end of this article , Make a brief summary of the troubleshooting process .

In limine , I'll start with the specific phenomenon , Expect to find the cause of the phenomenon . Made various conjectures , But there is no reasonable conclusion . Then expand the scope of information , Still fruitless . Then integrate various information , After thinking , Look in another direction , Found the reason . Final verification , And explain some doubtful points , End of the process .

Finally, let's talk about the problems in the troubleshooting process . The first problem is not paying attention to the information fed back by others , No quick confirmation of information , But directly into . Spent a lot of time trying various conjectures , In the end, there was no result . Because the information fed back by others is usually piecemeal and one-sided , Even incorrect . For this information, you can quickly confirm , If you're lucky, you can find the reason directly . But if you find this road blocked , Don't drill into a bull's corner , Because this phenomenon may only be one of many phenomena . In this case , The interface is long RT Look like druid As a result of , Actually, it's because GC cause STW As a result of . Keep going druid Direction check , In the end, it must be the opposite . Other problems are small problems , Don't say . in addition , In the process of troubleshooting, pay attention to saving some log data at that time , Avoid data loss due to data expiration , For example, Alibaba cloud RDS SQL Insight is time limited .

This is the end of the article , Thank you for reading !

This article is based on the knowledge sharing license agreement 4.0 Issue under , Reprint please indicate the source
author : Tian Xiao Bo
Original articles are preferentially published to personal websites , Welcome to visit :https://www.tianxiaobo.com

cc
This work adopts Creative Commons signature - Noncommercial use - No derivatives 4.0 International licensing agreement Licensing .

版权声明
本文为[Tian Xiaobo]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/08/20210808153635912v.html

随机推荐