当前位置:网站首页>How difficult is it to implement a counter with higher performance than longadder

How difficult is it to implement a counter with higher performance than longadder

2021-09-15 04:25:00 roshilikang

This article has been included https://github.com/lkxiaolou/lkxiaolou welcome star.

Tough LongAdder

LongAdder yes jdk8 Introduced thread safe counters for statistical scenarios .

Before that , Implementation of a thread safe counter or lock , Or use AtomicLong, Locking performance must be very poor ,AtomicLong Performance is much better , But in high concurrency 、 Multithreading , And it's hard . So there was LongAdder,LongAdder There are two important ways :add and sum,add Is thread safe plus ,sum It's the return result , The name sum Because LongAdder A set of variables are maintained by the idea of segmentation , Multithreaded concurrent updates are hashed to different variables for execution , Reduce conflict , So the final way to get the return value is to sum these variables . From this we can see that sum The results obtained are inaccurate , So it only applies to statistical scenarios , If you want to get the exact return value , Still have to use AtomicLong, You can't have both performance and accuracy .

adopt JMH test LongAdder、AtomicLong And the performance of the locked counter , Feel the LongAdder A powerful .( Unless otherwise specified , Later in this article JMH All tests are based on this standard :fork1 process ,4 Threads , preheating 2 Time , Official measurement 2 Time , Testing machine 4 nucleus , Full code Uploaded github, There is an address at the end of the article )

private final AtomicLong atomicLong = new AtomicLong();
    private final LongAdder longAdder = new LongAdder();
    private long counter = 0;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(LongAdderTest.class.getSimpleName())
                .forks(1)
                .threads(4)
                .warmupIterations(2)
                .measurementIterations(2)
                .mode(Mode.Throughput)
                .syncIterations(false)
                .build()
;

        new Runner(opt).run();
    }

    @Benchmark
    public void testAtomic() {
        atomicLong.incrementAndGet();
    }

    @Benchmark
    public void testLongAdder() {
        longAdder.increment();
    }

    @Benchmark
    public synchronized void testLockAdder() {
        counter++;
    }

After operation

Benchmark                          Mode  Cnt          Score   Error  Units
LongAdderTest.testAtomic          thrpt    2   73520672.658          ops/s
LongAdderTest.testLockAdder       thrpt    2   23456856.867          ops/s
LongAdderTest.testLongAdder       thrpt    2  300013067.345          ops/s

You can see LongAdder And the other two implementations are not on the same order of magnitude at all , Performance and horror . Since I know LongAdder The general principle of , Can we achieve a MyLongAdder, While keeping the writing thread safe , Performance is comparable or even better than LongAdder Well ?

AtomicLong piecewise (V0)

A lot of performance optimization depends on LongAdder This way of segmentation , Such as ConcurrentHashMap It's a segmented lock , So we also realized a V0 Version of MyLongAdder

public class MyLongAdderV0 {

    private final int coreSize;
    private final AtomicLong[] counts;

    public MyLongAdderV0(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLong[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLong();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() % coreSize);
        counts[index].incrementAndGet();
    }
}

Use one AtomicLong Array , When a thread is executing , By thread id Hash out ,coreSize The expectation here is cpu Check the number , and LongAdder、AtomicLong Compare it ( The test code is omitted , Same after )

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testAtomic         thrpt    2   73391661.579          ops/s
LongAdderTest.testLongAdder      thrpt    2  309539056.885          ops/s
LongAdderTest.testMyLongAdderV0  thrpt    2   83737867.380          ops/s

emmm,V0 Performance is only better than AtomicLong A little bit better , Follow LongAdder Still not on the same order of magnitude , Is the array not big enough ? take coreSize As a parameter , Test it 4, 8, 16, 32 The situation of , I tested it several times , Every time the result is different, but it's about the same order of magnitude ( Every once in a while, hundreds of millions ), It's impossible to summarize the results and coreSize The relationship between , Here's a set of

Benchmark                        (coreSize)   Mode  Cnt          Score   Error  Units
LongAdderTest.testMyLongAdderV0           4  thrpt    2   62328997.667          ops/s
LongAdderTest.testMyLongAdderV0           8  thrpt    2  124725716.902          ops/s
LongAdderTest.testMyLongAdderV0          16  thrpt    2   84718415.566          ops/s
LongAdderTest.testMyLongAdderV0          32  thrpt    2   85321816.442          ops/s

Guess it's because it's thread dependent id, The dispersion is not uniform enough to cause , And there's an interesting situation , occasionally V0 It's better than AtomicLong The performance of the system is still low .

Take the mold optimization (V1)

be aware V0 There's a mold taking operation inside , This operation may be time consuming , May lead to V0 It's not even as good as a single AtomicLong, It can be replaced by a shift operation , But the premise of replacement is coreSize It has to be for 2 Of n Power , Such as 2,4,8,16( We assume that in the future coreSize Take only 2 Of n Power ),V1 The code of the version is as follows :

public class MyLongAdderV1 {

    private final int coreSize;
    private final AtomicLong[] counts;

    public MyLongAdderV1(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLong[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLong();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
        counts[index].incrementAndGet();
    }

}

Test the performance

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  312683635.190          ops/s
LongAdderTest.testMyLongAdderV0  thrpt    2   60641758.648          ops/s
LongAdderTest.testMyLongAdderV1  thrpt    2  100887869.829          ops/s

The performance is a little better , But with LongAdder It's a long way from it

Eliminate pseudo sharing (V2)

stay cpu The memory in front of me is too slow , therefore cpu There are three levels of cache L3,L2,L1.L1 Nearest cpu, And the fastest ,cpu The order of search is first L1, Again L2, Again L3, Finally, if you can't get it, you go to the memory to get it . In general, each cache consists of many cache rows , Cache lines are usually 64 Bytes ,java Of long yes 8 byte , So a cache line can cache 8 individual long Variable . If the threads of multiple cores operate on different variable data in the same cache row , Then there will be frequent cache failures , Even at the code level, there is no relationship between the data operated by these two threads . This kind of unreasonable resource competition situation pseudonym sharing (False Sharing), It will seriously affect the concurrent execution efficiency of the machine .

stay V1 in ,AtomicLong There is one of them. value, Every time incrementAndGet It will change this value, meanwhile AtomicLong Is an array , The memory address of the array is also continuous , This will lead to adjacent AtomicLong Of value Cache invalidation , Other threads read this value It's going to be slow . The way to optimize is to fill in AtomicLong, Let each AtomicLong Of value Mutual isolation , Don't influence each other .

In general, there are several ways to fill cache rows :

  • (1)java8 You can use... On class properties @sun.misc.Contended,jvm Parameters need to be specified -XX:-RestrictContended
  • (2) Use inheritance to insert variables before and after the properties of the class , Here's an example of inheritance , If you don't have to inherit , These filled useless variables will be optimized by the compiler , Of course, you can also use arrays to construct padding , I won't go into that .
abstract class RingBufferPad {
    protected long p1, p2, p3, p4, p5, p6, p7;
}

abstract class RingBufferFields<Eextends RingBufferPad {
    protected long value;
}

public final class RingBuffer<Eextends RingBufferFields<E{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

We just use java8 Of @sun.misc.Contended Come on V1 To optimize

public class MyLongAdderV2 {

    private static class AtomicLongWrap {
        @Contended
        private final AtomicLong value = new AtomicLong();
    }

    private final int coreSize;
    private final AtomicLongWrap[] counts;

    public MyLongAdderV2(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLongWrap[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLongWrap();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
        counts[index].value.incrementAndGet();
    }

}

After the execution, something magical happened

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  272733686.330          ops/s
LongAdderTest.testMyLongAdderV2  thrpt    2  307754425.667          ops/s

incredibly V2 Version than LongAdder faster ! But is it true ? So , I tested a few more groups , The number of lines is 8 when

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  260909722.754          ops/s
LongAdderTest.testMyLongAdderV2  thrpt    2  215785206.276          ops/s

The number of threads is 16 when :

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  307269737.067          ops/s
LongAdderTest.testMyLongAdderV2  thrpt    2  185774540.302          ops/s

It is found that as the number of threads increases ,V2 It's getting worse and worse , but LongAdder not to turn a hair , I have to admire writing jdk Big guy .

improvement hash Algorithm

V0 To V2 All versions use threads id As hash Value to hash to different slot points , Threads id It doesn't change after it's generated , This will result in different test results each time , If it's more focused , Performance is bound to be poor , When the number of threads increases, it will inevitably cause more conflicts , Is there a better hash Algorithm ?

  • Try hashCode java Each object of the has a hashCode, We use the thread object hashCode Try hashing , edition V3 The key changes are as follows
public void increment() {
    int index = Thread.currentThread().hashCode() & (coreSize - 1);
    counts[index].incrementAndGet();
}

give the result as follows

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  277084413.669          ops/s
LongAdderTest.testMyLongAdderV3  thrpt    2  103351246.650          ops/s

The performance doesn't seem to be satisfactory .

  • Try random numbers

Of course use Random Of course not , With better performance ThreadLocalRandom,V4 The key changes to the version are as follows

public void increment() {
      counts[ThreadLocalRandom.current().nextInt(coreSize)].value.incrementAndGet();
  }

give the result as follows

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  292807355.101          ops/s
LongAdderTest.testMyLongAdderV4  thrpt    2   95200307.226          ops/s

The performance is not good , Guess because it's time consuming to generate random numbers .

  • Recalculate in case of conflict hash

In order to optimize the V4 edition , Refer to the LongAdder, It's a black technology , Generating a random number exists Thread In the object , You can look at it Thread class , It happens to have this variable

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@sun.misc.Contended("tlr")
int threadLocalRandomProbe;

But this variable is not open to the outside world , Only by reflection ( The poor performance ) perhaps UNSAFE Come and get it , It's in ThreadLocalRandomSeed Is initialized in , Regenerate and modify it when a conflict occurs ( The generation method can refer to ThreadLocalRandomSeed), through UNSAFE It can be done . Since it's time to start over in conflict hash, It has to be able to detect conflict ,AtomicLong You can't use incrementAndGet 了 , Use AtomicLong Of compareAndSet Method , return false It means there is a conflict , In case of conflict, re hash, And use incrementAndGet The bottom line , Guaranteed to be successful . In this way , It can be hashed out evenly , It can also ensure the efficiency of random number generation .V5 The version code is as follows

public class MyLongAdderV5 {

    private static sun.misc.Unsafe UNSAFE = null;
    private static final long PROBE;
    static {

        try {
            //  Reflection acquisition unsafe
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            UNSAFE = (Unsafe) f.get(null);
        } catch (Exception e) {

        }

        try {
            Class<?> tk = Thread.class;
            PROBE = UNSAFE.objectFieldOffset
                    (tk.getDeclaredField("threadLocalRandomProbe"));
        } catch (Exception e) {
            throw new Error(e);
        }
    }

    static final int getProbe() {
        //  obtain thread Of threadLocalRandomProbe Property value
        return UNSAFE.getInt(Thread.currentThread(), PROBE);
    }

    static final int advanceProbe(int probe) {
        //  Regenerate the random number and write thread object
        probe ^= probe << 13;   // xorshift
        probe ^= probe >>> 17;
        probe ^= probe << 5;
        UNSAFE.putInt(Thread.currentThread(), PROBE, probe);
        return probe;
    }

    private static class AtomicLongWrap {
        @Contended
        private final AtomicLong value = new AtomicLong();
    }

    private final int coreSize;
    private final AtomicLongWrap[] counts;

    public MyLongAdderV5(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLongWrap[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLongWrap();
        }
    }

    public void increment() {

        int h = getProbe();

        int index = getProbe() & (coreSize - 1);
        long r;
        if (!counts[index].value.compareAndSet(r = counts[index].value.get(), r + 1)) {
            if (h == 0) {
                //  Initialize random number
                ThreadLocalRandom.current();
                h = getProbe();
            }
            //  Regenerate random number after conflict
            advanceProbe(h);
            //  use getAndIncrement Come to the bottom
            counts[index].value.getAndIncrement();
        }
    }

}

give the result as follows

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  274131797.300          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  298402832.456          ops/s

The effect is OK , try 8 Threads :

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  324982482.774          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  290476796.289          ops/s

16 Threads :

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  291180444.998          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  282745610.470          ops/s

32 Threads :

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  294237473.396          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  301187346.873          ops/s

Sure enough, this method is very arrogant , No matter how many threads, it can be stable as .

summary

Achieve a transcendence LongAdder Multithreading counters for performance are very difficult , Tossed for two days, but also to achieve and LongAdder Quite good performance , One of the biggest performance changes is

  • piecewise : Based optimization , You can think of it
  • Take the mold optimization : It's more basic
  • Eliminate pseudo sharing : This optimization has been greatly improved
  • hash Algorithm : This one guarantees stability , No matter how many threads there are, it's the highest throughput

The first three are more conventional , The fourth can be regarded as Black science and technology


All the test code has been uploaded https://github.com/lkxiaolou/all-in-one/tree/master/src/main/java/org/newboo/longadder


WeChat official account " Master bug catcher ", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .

版权声明
本文为[roshilikang]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909112309716m.html

随机推荐