当前位置:网站首页>Skywalking memory leak troubleshooting

Skywalking memory leak troubleshooting

2021-09-15 04:25:08 roshilikang

This article has been included https://github.com/lkxiaolou/lkxiaolou welcome star.

Background introduction

Recently wrote about dubbo Memory leaks are a little more complicated , Many people said they couldn't understand , Think of the simple memory leak problem encountered before , Easier to get started , So take it out and share it .

In order to fuse, degrade and limit the current of microservices , Introduced sentinel Components , about sentinel It was introduced into the company for internal use, but it was simply customized , Such as persistent configuration rules , Monitoring data collection and display , Background login permission integration, etc .

When the function verification is passed , Pressure measurements were also made , The performance meets the requirements , So it was put on the production line . No problem at first , Until a degradation rule is configured on one antenna , And triggered , The monitoring alarm blew the pot .

At first, the service has a lot of slow requests , Then the service was completely dead . Check the monitor , Slow request a lot ,cpu soaring ,full GC frequent , The memory is full , It also appears in the log java.lang.OutOfMemoryError, It can be concluded that it is a memory problem .

Troubleshoot problems

Since only the degradation rule was enabled for the operation of the system at that time , So immediately delete the rule and restart , System recovery , But did not save a copy of memory dump file . I thought it would be easy to check if the problem could be repeated , I tried it on the original advance machine , No recurrence . I want to go before I think about it , Is it about machines ? So I took off a machine on the line , Configuration rules , A little pressure test , Sure enough, something went wrong .

It's easy to reproduce , hurriedly dump Memory , Many people don't know how to dump java Memory file for , have access to jdk Self contained jmap Order it dump, Use jmap dump Memory will trigger once full GC, So be careful when using online ,full GC To ensure the dump All objects in memory are alive ( Can't release ).

jmap -dump:format=b,file=dump.bin ${pid}

dump If the memory is too large , have access to tar Compress the command and then download it to the local analysis . Use of analysis tools eclipse Plug in for mat, Its official address is as follows

https://www.eclipse.org/mat/

from dump The file shows dubbo Each thread of takes up 2% Of memory , The application sets 200 Threads , In theory, it has burst the memory . Expand a thread to see

The dubbo There is a big... In the thread StringBuilder object , Copy its value , I found that this string has 200MB, Only the first line is different , Followed by repeated strings .

com.alibaba.csp.sentinel.slots.block.SentinelRpcException: com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
com.alibaba.csp.sentinel.slots.block.flow.FlowException
...

see dubbo This attribute is not found in the source code of , Only know with sentinel of .

Clues are difficult to find out the truth , A lot of the time , The memory leak problem can only analyze a clue from the memory itself , Not enough to find out the truth , Unless it's a very simple question .

From experience , Memory leaks are accompanied by cpu elevated , Because there is not enough memory to trigger full GC, but full GC Unable to free memory , A vicious cycle , So I didn't see it at first cpu The problem of . With a try attitude, it reappeared on the scene again , And use jstack Command to print the thread stack , Want to see, except GC Are there any threads outside the thread that occupy cpu

jstack ${pid} > jstack.txt

I found the problem

"DubboServerHandler-127.0.0.1:20880-thread-200" #532 daemon prio=5 os_prio=0 tid=0x00007f264c1f8000 nid=0x581a waiting for monitor entry [0x00007f25bae09000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at org.apache.skywalking.apm.agent.core.context.util.ThrowableTransformer.printExceptionInfo(ThrowableTransformer.java:57)
    at org.apache.skywalking.apm.agent.core.context.util.ThrowableTransformer.convert2String(ThrowableTransformer.java:34)
    at org.apache.skywalking.apm.agent.core.context.trace.AbstractTracingSpan.log(AbstractTracingSpan.java:152)
    at org.apache.skywalking.apm.agent.core.context.trace.ExitSpan.log(ExitSpan.java:112)
    at org.apache.skywalking.apm.agent.core.context.trace.ExitSpan.log(ExitSpan.java:38)
    at org.apache.skywalking.apm.plugin.dubbo.DubboInterceptor.dealException(DubboInterceptor.java:124)
    at org.apache.skywalking.apm.plugin.dubbo.DubboInterceptor.handleMethodException(DubboInterceptor.java:115)
    at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:97)
    at ...

From the stack, it is found that skywalking It's been implemented Arrays.copy,skywalking In short, it is a component for collecting distributed call chain , Its principle is aimed at java In code “ call “ Bytecode enhancement at , Achieve zero intrusion to the business and obtain the call information ,github The address is as follows

https://github.com/apache/skywalking

This explains why this problem cannot be found during pressure measurement , Because the pressure measuring machine is not deployed skywalking.

Finding this stack basically solves the problem , Then just look at the code .

stay skywalking Report exception Will put the stack in StringBuilder in , But here comes bug, stay stackTrace When it's empty , There will be a cycle append, Until the memory runs out

meanwhile , Also in the github I found a pair of this bug Repair of , Here is the use of skywalking Too old version leads to

https://github.com/apache/skywalking/pull/2931

This bug, about stackTrace Not empty trace when , Only two layers can be recorded exception, about stackTrace Empty trace direct OOM, in other words sentinel Exception thrown due to current limiting degradation stackTrace It's empty . Here is given sentinel Of BlockException Part of the code , It rewrites fillInStackTrace Method , Go straight back to this.

public abstract class BlockException extends Exception {
  @Override
  public Throwable fillInStackTrace() {
      return this;
  }
  ...
}

The default implementation of this method is through native Method , After rewriting here, it returns directly this, Stack information is not recorded

/**
 * Fills in the execution stack trace. This method records within this
 * {@code Throwable} object information about the current state of
 * the stack frames for the current thread.
 *
 * <p>If the stack trace of this {@code Throwable} {@linkplain
 * Throwable#Throwable(String, Throwable, boolean, boolean) is not
 * writable}, calling this method has no effect.
 *
 * @return  a reference to this {@code Throwable} instance.
 * @see     java.lang.Throwable#printStackTrace()
 */

public synchronized Throwable fillInStackTrace() {
    if (stackTrace != null ||
        backtrace != null /* Out of protocol state */ ) {
        fillInStackTrace(0);
        stackTrace = UNASSIGNED_STACK;
    }
    return this;
}

private native Throwable fillInStackTrace(int dummy);

We also know that if the exception stack is too deep, it will affect performance , about sentinel This requires very high-performance components to directly remove the exception stack information, which is a way to optimize performance “ Black science and technology ”. This also gives us a reminder , When pressure measuring system performance, not only normal conditions should be considered , You also need to consider exceptions . If a pressure measuring system can resist 5000qps Normal request for , Throwing an exception can only bear 2000qps, Then the normal pressure measured 5000 It may not be achieved in actual production .

summary

  • The memory leak problem is accompanied by cpu, Error rate ,GC Frequently wait for questions

  • The most important thing about memory leakage is to get the on-site memory dump file , And use tools combined with source code analysis

  • If article 2 doesn't solve the problem , We need to find a new breakthrough , such as jstack etc.


WeChat official account " Master bug catcher ", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .

版权声明
本文为[roshilikang]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909112309733d.html

随机推荐