当前位置:网站首页>How difficult is it to implement a counter with higher performance than longadder
How difficult is it to implement a counter with higher performance than longadder
2021-09-15 04:25:00 【roshilikang】
This article has been included https://github.com/lkxiaolou/lkxiaolou welcome star.
Tough LongAdder
LongAdder yes jdk8 Introduced thread safe counters for statistical scenarios .
Before that , Implementation of a thread safe counter or lock , Or use AtomicLong, Locking performance must be very poor ,AtomicLong Performance is much better , But in high concurrency 、 Multithreading , And it's hard . So there was LongAdder,LongAdder There are two important ways :add and sum,add Is thread safe plus ,sum It's the return result , The name sum Because LongAdder A set of variables are maintained by the idea of segmentation , Multithreaded concurrent updates are hashed to different variables for execution , Reduce conflict , So the final way to get the return value is to sum these variables . From this we can see that sum The results obtained are inaccurate , So it only applies to statistical scenarios , If you want to get the exact return value , Still have to use AtomicLong, You can't have both performance and accuracy .
adopt JMH test LongAdder、AtomicLong And the performance of the locked counter , Feel the LongAdder A powerful .( Unless otherwise specified , Later in this article JMH All tests are based on this standard :fork1 process ,4 Threads , preheating 2 Time , Official measurement 2 Time , Testing machine 4 nucleus , Full code Uploaded github, There is an address at the end of the article )
private final AtomicLong atomicLong = new AtomicLong();
private final LongAdder longAdder = new LongAdder();
private long counter = 0;
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(LongAdderTest.class.getSimpleName())
.forks(1)
.threads(4)
.warmupIterations(2)
.measurementIterations(2)
.mode(Mode.Throughput)
.syncIterations(false)
.build();
new Runner(opt).run();
}
@Benchmark
public void testAtomic() {
atomicLong.incrementAndGet();
}
@Benchmark
public void testLongAdder() {
longAdder.increment();
}
@Benchmark
public synchronized void testLockAdder() {
counter++;
}
After operation
Benchmark Mode Cnt Score Error Units
LongAdderTest.testAtomic thrpt 2 73520672.658 ops/s
LongAdderTest.testLockAdder thrpt 2 23456856.867 ops/s
LongAdderTest.testLongAdder thrpt 2 300013067.345 ops/s
You can see LongAdder And the other two implementations are not on the same order of magnitude at all , Performance and horror . Since I know LongAdder The general principle of , Can we achieve a MyLongAdder, While keeping the writing thread safe , Performance is comparable or even better than LongAdder Well ?
AtomicLong piecewise (V0)
A lot of performance optimization depends on LongAdder This way of segmentation , Such as ConcurrentHashMap It's a segmented lock , So we also realized a V0 Version of MyLongAdder
public class MyLongAdderV0 {
private final int coreSize;
private final AtomicLong[] counts;
public MyLongAdderV0(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLong[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLong();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() % coreSize);
counts[index].incrementAndGet();
}
}
Use one AtomicLong Array , When a thread is executing , By thread id Hash out ,coreSize The expectation here is cpu Check the number , and LongAdder、AtomicLong Compare it ( The test code is omitted , Same after )
Benchmark Mode Cnt Score Error Units
LongAdderTest.testAtomic thrpt 2 73391661.579 ops/s
LongAdderTest.testLongAdder thrpt 2 309539056.885 ops/s
LongAdderTest.testMyLongAdderV0 thrpt 2 83737867.380 ops/s
emmm,V0 Performance is only better than AtomicLong A little bit better , Follow LongAdder Still not on the same order of magnitude , Is the array not big enough ? take coreSize As a parameter , Test it 4, 8, 16, 32 The situation of , I tested it several times , Every time the result is different, but it's about the same order of magnitude ( Every once in a while, hundreds of millions ), It's impossible to summarize the results and coreSize The relationship between , Here's a set of
Benchmark (coreSize) Mode Cnt Score Error Units
LongAdderTest.testMyLongAdderV0 4 thrpt 2 62328997.667 ops/s
LongAdderTest.testMyLongAdderV0 8 thrpt 2 124725716.902 ops/s
LongAdderTest.testMyLongAdderV0 16 thrpt 2 84718415.566 ops/s
LongAdderTest.testMyLongAdderV0 32 thrpt 2 85321816.442 ops/s
Guess it's because it's thread dependent id, The dispersion is not uniform enough to cause , And there's an interesting situation , occasionally V0 It's better than AtomicLong The performance of the system is still low .
Take the mold optimization (V1)
be aware V0 There's a mold taking operation inside , This operation may be time consuming , May lead to V0 It's not even as good as a single AtomicLong, It can be replaced by a shift operation , But the premise of replacement is coreSize It has to be for 2 Of n Power , Such as 2,4,8,16( We assume that in the future coreSize Take only 2 Of n Power ),V1 The code of the version is as follows :
public class MyLongAdderV1 {
private final int coreSize;
private final AtomicLong[] counts;
public MyLongAdderV1(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLong[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLong();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
counts[index].incrementAndGet();
}
}
Test the performance
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 312683635.190 ops/s
LongAdderTest.testMyLongAdderV0 thrpt 2 60641758.648 ops/s
LongAdderTest.testMyLongAdderV1 thrpt 2 100887869.829 ops/s
The performance is a little better , But with LongAdder It's a long way from it
Eliminate pseudo sharing (V2)
stay cpu The memory in front of me is too slow , therefore cpu There are three levels of cache L3,L2,L1.L1 Nearest cpu, And the fastest ,cpu The order of search is first L1, Again L2, Again L3, Finally, if you can't get it, you go to the memory to get it . In general, each cache consists of many cache rows , Cache lines are usually 64 Bytes ,java Of long yes 8 byte , So a cache line can cache 8 individual long Variable . If the threads of multiple cores operate on different variable data in the same cache row , Then there will be frequent cache failures , Even at the code level, there is no relationship between the data operated by these two threads . This kind of unreasonable resource competition situation pseudonym sharing (False Sharing), It will seriously affect the concurrent execution efficiency of the machine .
stay V1 in ,AtomicLong There is one of them. value, Every time incrementAndGet It will change this value, meanwhile AtomicLong Is an array , The memory address of the array is also continuous , This will lead to adjacent AtomicLong Of value Cache invalidation , Other threads read this value It's going to be slow . The way to optimize is to fill in AtomicLong, Let each AtomicLong Of value Mutual isolation , Don't influence each other .
In general, there are several ways to fill cache rows :
-
(1)java8 You can use... On class properties @sun.misc.Contended,jvm Parameters need to be specified -XX:-RestrictContended -
(2) Use inheritance to insert variables before and after the properties of the class , Here's an example of inheritance , If you don't have to inherit , These filled useless variables will be optimized by the compiler , Of course, you can also use arrays to construct padding , I won't go into that .
abstract class RingBufferPad {
protected long p1, p2, p3, p4, p5, p6, p7;
}
abstract class RingBufferFields<E> extends RingBufferPad {
protected long value;
}
public final class RingBuffer<E> extends RingBufferFields<E> {
protected long p1, p2, p3, p4, p5, p6, p7;
}
We just use java8 Of @sun.misc.Contended
Come on V1 To optimize
public class MyLongAdderV2 {
private static class AtomicLongWrap {
@Contended
private final AtomicLong value = new AtomicLong();
}
private final int coreSize;
private final AtomicLongWrap[] counts;
public MyLongAdderV2(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLongWrap[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLongWrap();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
counts[index].value.incrementAndGet();
}
}
After the execution, something magical happened
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 272733686.330 ops/s
LongAdderTest.testMyLongAdderV2 thrpt 2 307754425.667 ops/s
incredibly V2 Version than LongAdder faster ! But is it true ? So , I tested a few more groups , The number of lines is 8 when
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 260909722.754 ops/s
LongAdderTest.testMyLongAdderV2 thrpt 2 215785206.276 ops/s
The number of threads is 16 when :
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 307269737.067 ops/s
LongAdderTest.testMyLongAdderV2 thrpt 2 185774540.302 ops/s
It is found that as the number of threads increases ,V2 It's getting worse and worse , but LongAdder not to turn a hair , I have to admire writing jdk Big guy .
improvement hash Algorithm
V0 To V2 All versions use threads id As hash Value to hash to different slot points , Threads id It doesn't change after it's generated , This will result in different test results each time , If it's more focused , Performance is bound to be poor , When the number of threads increases, it will inevitably cause more conflicts , Is there a better hash Algorithm ?
-
Try hashCode java Each object of the has a hashCode, We use the thread object hashCode Try hashing , edition V3 The key changes are as follows
public void increment() {
int index = Thread.currentThread().hashCode() & (coreSize - 1);
counts[index].incrementAndGet();
}
give the result as follows
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 277084413.669 ops/s
LongAdderTest.testMyLongAdderV3 thrpt 2 103351246.650 ops/s
The performance doesn't seem to be satisfactory .
-
Try random numbers
Of course use Random Of course not , With better performance ThreadLocalRandom,V4 The key changes to the version are as follows
public void increment() {
counts[ThreadLocalRandom.current().nextInt(coreSize)].value.incrementAndGet();
}
give the result as follows
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 292807355.101 ops/s
LongAdderTest.testMyLongAdderV4 thrpt 2 95200307.226 ops/s
The performance is not good , Guess because it's time consuming to generate random numbers .
-
Recalculate in case of conflict hash
In order to optimize the V4 edition , Refer to the LongAdder, It's a black technology , Generating a random number exists Thread In the object , You can look at it Thread class , It happens to have this variable
/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@sun.misc.Contended("tlr")
int threadLocalRandomProbe;
But this variable is not open to the outside world , Only by reflection ( The poor performance ) perhaps UNSAFE Come and get it , It's in ThreadLocalRandomSeed Is initialized in , Regenerate and modify it when a conflict occurs ( The generation method can refer to ThreadLocalRandomSeed), through UNSAFE It can be done . Since it's time to start over in conflict hash, It has to be able to detect conflict ,AtomicLong You can't use incrementAndGet 了 , Use AtomicLong Of compareAndSet Method , return false It means there is a conflict , In case of conflict, re hash, And use incrementAndGet The bottom line , Guaranteed to be successful . In this way , It can be hashed out evenly , It can also ensure the efficiency of random number generation .V5 The version code is as follows
public class MyLongAdderV5 {
private static sun.misc.Unsafe UNSAFE = null;
private static final long PROBE;
static {
try {
// Reflection acquisition unsafe
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
UNSAFE = (Unsafe) f.get(null);
} catch (Exception e) {
}
try {
Class<?> tk = Thread.class;
PROBE = UNSAFE.objectFieldOffset
(tk.getDeclaredField("threadLocalRandomProbe"));
} catch (Exception e) {
throw new Error(e);
}
}
static final int getProbe() {
// obtain thread Of threadLocalRandomProbe Property value
return UNSAFE.getInt(Thread.currentThread(), PROBE);
}
static final int advanceProbe(int probe) {
// Regenerate the random number and write thread object
probe ^= probe << 13; // xorshift
probe ^= probe >>> 17;
probe ^= probe << 5;
UNSAFE.putInt(Thread.currentThread(), PROBE, probe);
return probe;
}
private static class AtomicLongWrap {
@Contended
private final AtomicLong value = new AtomicLong();
}
private final int coreSize;
private final AtomicLongWrap[] counts;
public MyLongAdderV5(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLongWrap[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLongWrap();
}
}
public void increment() {
int h = getProbe();
int index = getProbe() & (coreSize - 1);
long r;
if (!counts[index].value.compareAndSet(r = counts[index].value.get(), r + 1)) {
if (h == 0) {
// Initialize random number
ThreadLocalRandom.current();
h = getProbe();
}
// Regenerate random number after conflict
advanceProbe(h);
// use getAndIncrement Come to the bottom
counts[index].value.getAndIncrement();
}
}
}
give the result as follows
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 274131797.300 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 298402832.456 ops/s
The effect is OK , try 8 Threads :
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 324982482.774 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 290476796.289 ops/s
16 Threads :
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 291180444.998 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 282745610.470 ops/s
32 Threads :
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 294237473.396 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 301187346.873 ops/s
Sure enough, this method is very arrogant , No matter how many threads, it can be stable as .
summary
Achieve a transcendence LongAdder Multithreading counters for performance are very difficult , Tossed for two days, but also to achieve and LongAdder Quite good performance , One of the biggest performance changes is
-
piecewise : Based optimization , You can think of it -
Take the mold optimization : It's more basic -
Eliminate pseudo sharing : This optimization has been greatly improved -
hash Algorithm : This one guarantees stability , No matter how many threads there are, it's the highest throughput
The first three are more conventional , The fourth can be regarded as Black science and technology
All the test code has been uploaded https://github.com/lkxiaolou/all-in-one/tree/master/src/main/java/org/newboo/longadder
WeChat official account " Master bug catcher ", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .

版权声明
本文为[roshilikang]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909112309716m.html
边栏推荐
- Garde la distance.
- 90 lines of code to implement the module packer
- When the OrCAD schematic is opened, suddenly the schematic file cannot be found
- Purpose and difference between Maitreya clamp and active clamp of IGBT
- [PHP source code] z-blogphp pirate navigation theme template
- Architecture and data science behind hundreds of thousands of experiments a year
- 大廠Android面試總結 詳細解答,Android技術圖譜
- Résumé de l'entrevue Android de Dachang, carte technique Android
- 大厂程序员35岁后的职业出路在哪,京东最新Android面试真题解析
- 万字长文,面试官老爱问适配器模式与外观模式,
猜你喜欢
-
大厂Offer拿到手软啊,程序员中年危机
-
大厂Offer拿到手软啊,不可思议
-
Le résumé de l'entrevue Android de Dachang est en retard
-
Agent de cache SQUID
-
6 - year New Bird Development interview Byte Jumping Android R & D Post, Dry goods collation
-
Notes d'étude de HongMeng, 714 pages PDF, or, neuf, argent et dix
-
大廠Offer拿到手軟啊,不可思議
-
L'offre de la grande usine est douce. Incroyable.
-
L'offre d'une grande usine est douce.
-
L'intervieweur demande toujours le mode adaptateur et le mode d'apparence.
随机推荐
- Où est la sortie professionnelle pour les programmeurs d'usine après 35 ans?
- 大廠Offer拿到手軟啊,程序員中年危機
- Stockage de produits Cloud packages, illimité et gratuit
- Opencv4 machine learning (VI): principle and implementation of k-means
- [yolop interpretation] you only look once for panoramic driving perception
- Tencent Cloud et d'autres systèmes de protection préférentiels
- Nombre maximum de points en ligne droite
- [interpretation of pointpillars] fast encoder for point cloud target detection
- [rangenet + + interpretation] fast and accurate lidar semantic segmentation
- [yolof interpretation] you only look one level feature (CVPR 2021)
- [SNE roadseg interpretation] pavement segmentation network combined with surface normal vector (eccv2020)
- [yolox interpretation] anchor free target detector comparable to yolov5!
- Opencv4 machine learning (4): geometric transformation and affine transformation of images
- Detailed explanation of fast SCNN semantic segmentation network
- 萬字長文,面試官老愛問適配器模式與外觀模式,
- 大牛耗时一年最佳总结,让你的app体验更丝滑,BAT大厂面试总结
- 不敢跟面试官对线,通过五轮面试斩获offer阿里实习生亲述,
- Comparison of location loss functions in target detection network: IOU, giou, ciou, Diou, L1, L2, smooth L1
- Blue Bridge Cup software provincial competition in April 2021: title + analysis (full version)
- [C + + cultivation plan] don't talk about learning, just talk about dry goods (Day1)
- Tf2.0 deep learning practice (1): handwritten numeral recognition for classification problem
- 大廠程序員35歲後的職業出路在哪,京東最新Android面試真題解析
- Invitation | réunion Apache Pulsar 2021 - Shenzhen ce samedi
- The Dot Net Application Domain Study
- Trigger study
- Universal code, achieving with action -- Safety code scanning Professional Committee
- N'osez pas vous opposer à l'intervieweur et obtenir des commentaires personnels des stagiaires d'offer Ali après cinq rondes d'entrevue.
- Daniel prend le meilleur résumé de l'année pour rendre votre expérience d'application plus soyeuse.
- Easy to use and convenient development team management tool -- apipost
- 如何试出一个Android开发者真正的水平,【面试总结】
- 如何才能通过一线互联网公司面试,Android经典入门教程
- Comment passer une entrevue avec une entreprise Internet de première ligne, Android Classic Getting started tutoriel
- Comment essayer un développeur Android vraiment niveau, 【 résumé de l'entrevue 】
- Ad redefines PCB size
- [wonderful learning journey of cloud computing] phase I: getting to know cloud computing for the first time
- Sf58-asemi high voltage fast recovery diode in-line package
- Asp.net quick build application page main framework
- Soul painter: cartoon illustration SSH
- Special live broadcast of the first anniversary celebration of Hongmeng community · invitation letter
- Mathematics Interview Questions (X)