当前位置:网站首页>Deep understanding of the nature of atomic operations

Deep understanding of the nature of atomic operations

2021-01-23 23:24:10 itread01

Original address :[https://blog.fanscore.cn/p/34/](https://blog.fanscore.cn/p/34/)# The introduction of this paper is based on go1.14 darwin/amd64 For example, atomic operations in , Explore the assembly implementation of atomic operations , extraction `LOCK`** Command prefix **、** Visibility **、**MESI Agreement **、**Store Buffer**、**Invalid Queue**、** Memory barrier **, By right of CPU The exploration of architecture , So as to understand the above concepts , And give some facts in the end .# Go In the atomic operation of, we use `atomic.CompareAndSwapInt32` For example , Its function prototype is :```func CompareAndSwapInt32(addr *int32, old, new int32) (swapped bool)``` The corresponding assembly code is :```// sync/atomic/asm.s 24 That's ok TEXT ·CompareAndSwapInt32(SB),NOSPLIT,$0 JMP runtime∕internal∕atomic·Cas(SB)``` By jump command JMP Jump to `runtime∕internal∕atomic·Cas(SB)`, Due to different architectures, the corresponding assembly code is also different , Let's see amd64 The code corresponding to the platform :```// runtime/internal/atomic/asm_amd64.s 17 That's ok TEXT runtime∕internal∕atomic·Cas(SB),NOSPLIT,$0-17 MOVQ ptr+0(FP), BX // The first argument of the function is addr Load into BX register MOVL old+8(FP), AX // The second argument of the function is old Load into AX register MOVL new+12(FP), CX // // The first argument of the function is new Load into CX register LOCK // The key instructions in this paper , I'll go into details below CMPXCHGL CX, 0(BX) // hold AX The contents of the register ( namely old) And BX Address data in the register ( namely addr) Compare the data pointed to, and if they are equal, put the first operator, that is CX Information in ( namely new) Assign to the second operator SETEQ ret+16(FP) // SETEQ And CMPXCHGL In combination with , Here, if CMPXCHGL If the comparison results are equal, set the return value of this function as 1, Otherwise 0(16(FP) Is the return value, that is swapped The address of ) RET // The function returns ``` You can see the key of this article from the code above :`LOCK`. It's actually an instruction prefix , It has to be followed by `read-modify-write` Instructions , such as :`ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, XCHG`.# LOCK The principle of implementation was in the early days CPU On LOCK The command locks the bus bar , That is, other cores can no longer communicate with memory through bus bars , So that the core can monopolize the memory . This method solves the problem, but the efficiency is too poor , So in Intel P6 CPU(P6 It's an architecture , It's not specific CPU) Introduce an optimization : If the data has been cached in CPU cache in , Lock cache , Otherwise, lock the bus bar .# Cache Coherency[CPU Cache And False Sharing](https://blog.fanscore.cn/p/25/) This paper introduces in detail CPU The structure of the cache ,CPU Caching brings consistency issues , Take a simple example :```// Suppose CPU0 The function is executed var a int = 0go func fnInCpu0() { time.Sleep(1 * time.Second) a = 1 // 2. stay CPU1 After loading a After that CPU0 I just changed my core cache But there's no synchronization for CPU1}()// CPU1 The function is executed go func fnInCpu1() { fmt.Println(a) // 1. CPU1 Will a Load into your own cache, Now a=0 time.Sleep(3 * time.Second) fmt.Println(a) // 3. CPU1 From cache I read that a=0, But now a Has been CPU0 It is amended as follows 0 了 }()``` In the above example, due to CPU There is no guarantee of cache consistency , As a result, the same data between the two cores is not visible, so there is a problem with the program , therefore CPU Cache consistency must be guaranteed , The following will introduce CPU How to pass **MESI Agreement ** Cache consistent .MESI There are four cacheline Short for state :* M(Modified): This state is the cacheline Modified by the core , And make sure it's not in the other core cacheline On * E(Exclusive): Identify the cacheline Exclusive to the core , There is no copy of the row on the other cores . The core can modify the row directly without notifying other cores .* S(Share): The cacheline On multiple cores , But there was no change , The current core cannot be modified directly , Changes to the bank must be negotiated with other cores .* I(Invaild): The cacheline Invalid ,cacheline The initial state of , State if it is not in the cache , If the content is out of date . Negotiation and communication between cores need the following message mechanism :* Read: CPU Initiate a data read request , The address of the data contained in the request * Read Response: Read The response to the message , It is possible that the message is memory responsive , It's possible that other core responses ( That is, the address exists on other cores cacheline in , And the status is Modified, At this time, we must return the latest information )* Invalidate: The core informs the other cores that their core counterparts cacheline Set as Invalid* Invalidate ACK: Other core pairs Invalidate Response to notification , Will correspond to cacheline Set as Invalid Then send the confirmation message * Read Invalidate: It is equivalent to Read Message +Invalidate Message , That is, the current core needs to read and modify the data .* Write Back: Write back , About to Modified The data is written back to the lower level memory , Writeback delays memory updates as much as possible , Only when the replacement algorithm wants to evict the updated block is it written back to the lower level memory .### Hand drawn state transition diagram ![image.png](//blog.fanscore.cn/static/img/blog/2021/01/21/2499322772995198327.png)> There's a doubt here :CPU Read data from memory I The state is to transfer to S Or E, When we look up the data, there are two versions . I think it should be E, Because in this way, when another core wants to load a copy, it only needs to go to the current core to get it, and there is no need to read the memory , It will be more effective , If you have different opinions, please share them in the comments section .### Some rules 1. CPU In the revision cacheline Ask others to hold the cacheline The core of the replica fails , And through `Invalidate ACK` To receive feedback 2. cacheline For M It means that the data in memory is not up to date , The latest information is in the cacheline On 3. The information is in cacheline When , If the status is E, Then directly modify ; If the status is S You need to broadcast `Invalidate` Message , received `Invalidate ACK` The post modification status is M; If the status is I( Include cache miss) You need to send out `Read Invalidate`# Store Buffer When CPU To modify a S Status information needs to be sent out Invalidate Message and wait ACK Just write the information , This process is obviously a synchronous process , But it's very fast for calculation CPU It's obviously not acceptable for me , It has to be optimized . So we're thinking about CPU And cache Add one between buffer,CPU You can write data to this first buffer And send a message , And then it can do something else , Wait for the message to respond and then start from buffer Write to cache in . But there's an obvious logical loophole , Consider this code :```a = 1b = a + 1``` Suppose a The initial value is 0, And then CPU Execute `a=1`, The data is written in Store Buffer It's not landing yet, and then it's being executed `b=a+1`, Because of a It hasn't been modified yet , therefore CPU What I read is still 0, It's finally worked out b=1. To solve this obvious logic loophole , And put forward the **Store Forwarding**:CPU You can put Buffer Read it out and pass it on (forwarding) For the following read operation , Without having to go to cache Chinese Reading .![image.png](//blog.fanscore.cn/static/img/blog/2021/01/21/16079219243711812.png) This solves the above loopholes , But there's another problem , Let's look at the following code :```a = 0flag = falsefunc runInCpu0() { a = 1 flag = true}func runInCpu1() { while (!flag) { continue } print(a)}``` For the above code, we assume that there are the following steps :1. Suppose that at present a Existing in cpu1 Of cache in ,flag Existing in cpu0 Of cache in , The state is E.2. cpu1 Execute first while(!flag), Because of flag Does not exist in its cache in , So it sends out Read flag Message 3. cpu0 Execute a=1, its cache Not in a, So it will a=1 Write Store Buffer, And send out Invalidate a Message 4. cpu0 Execute flag=true, Because of flag There is something in it cache And the state is E, So will flag=true Write directly to cache, The status is changed to M5. cpu0 Received Read flag Message , Will cache Medium flag=true Send it back to cpu1, The status is changed to S6. cpu1 received cpu0 Of Read Response:flat=true, End while(!flag) Turn around 7. cpu1 Print a, By this time a There is something in it cache in a=0, So it's printed out 08. cpu1 At this time, I received Invalidate a Message , Will cacheline The status is changed to I, But it's too late 9. cpu0 received Invalidate ACK, Will Store Buffer Information in a=1 Brush to cache From a code point of view , Our code seems to have become ```func runInCpu0() { flag = true a = 1}``` It seems to have been reordered , This is actually a kind of ** Pseudo reordering **, New ways must be put forward to solve the above problems # Write barriers CPU From the software level, it provides ** Write barriers (write memory barrier)** Instructions to solve the above problems ,linux Will CPU Write barrier encapsulated as smp_wmb() Function . The way to solve the above problem is to put the current Store Buffer The information in the cache Then perform the write operation behind the barrier .> SMP: Symmetrical Multi-Processing, Multiprocessors . Here you may be curious that the problem above is hardware ,CPU Why not ask software developers to avoid it by instructions instead of solving problems by themselves ? Actually, it's a good answer :CPU We can't abandon it for the sake of this aspect Store Buffer Huge performance improvement , It's like CPU We can't abandon branch prediction because branch prediction error will consume efficiency and increase power consumption . Take the code above as an example , The premise remains the same , Then we join the writing barrier :```a = 0flag = falsefunc runInCpu0() { a = 1 smp_wmb() flag = true}func runInCpu1() { while (!flag) { continue } print(a)}``` When cpu0 Execute flag=true When , Because of Store Buffer There is a=1 Not yet cache On , So I will first a=1 Brush to cache We'll do it later flag=true, When cpu1 Read flag=true When ,a It's just =1 了 .> There is an article pointing out that CPU There is also a way to implement the write barrier :CPU Will present store buffer Mark items in , And then put the... Behind the barrier “ Write operation ” I also wrote that Store Buffer in ,cpu Keep doing other things , When all the marked items are brushed to cache in , Then brush the following items .# Invalid Queue After solving the problem of pseudo reordering by writing barrier , There's another question to think about , That's it Store Buffer size It is limited. , When Store Buffer When it's full CPU Still have to wait Invalidate ACK.Invalidate ACK The main reason it takes time is CPU You need to be yourself first cacheline State modification I And then respond ACK, If one CPU Very busy or in S There are so many copies of state , Maybe all CPU All waiting for it ACK.CPU The way to optimize this problem is to make a Invalid Queue,CPU First will Invalidate Messages are put in this queue , And then we respond Invalidate ACK. But this brings new problems , Take the code above as an example ```a = 0flag = falsefunc runInCpu0() { a = 1 smp_wmb() flag = true}func runInCpu1() { while (!flag) { continue } print(a)}``` We assume a stay CPU0 and CPU1 in , And the state is S,flag from CPU0 Monopoly 1. CPU0 Execute a=1, Because a The status is S, So it will a=1 Write Store Buffer, And send out Invalidate a Message 2. CPU1 Execute while(!flag), Because of it cache Not in flag, So it sends out Read flag Message 3. CPU1 received CPU0 Of Invalidate a Message , And write this message to Invalid Queue, And then I responded Invlidate ACK4. CPU0 received CPU1 Of Invalidate ACK After that a=1 Brush to cache in , And change its state to M5. CPU0 Execute to smp_wmb(), Because of Store Buffer It's empty at this time, so it's going down 6. CPU0 Execute flag=true, Because flag The status is E, So it will directly flag=true Write to cache, The state is modified to M7. CPU0 received Read flag Message , Because it cache There is flag, So it responds to Read Response, And change the status to S8. CPU1 received Read flag Response, Now flag=true, So it's over while Turn around 9. CPU1 Print a, Because of a There is something in it cache And the state is S, So I'll just cache Medium a It's printed out , Now a=0, It's obviously a mistake .10. CPU1 This is the time to deal with Invalid Queue The message in will a The status is changed to I, But it's too late to solve the above problems ,CPU Read barrier instruction is proposed ,linux Encapsulate it as smp_rwm() Function . That's what we put in our code :```...func runInCpu1() { while (!flag) { continue } smp_rwm() print(a)}``` When CPU Execute to smp_rwm() When , Will Invalid Queue After the data processing is completed, the read operation behind the barrier is performed , This solves the above problem . In addition to the read and write barriers mentioned above , There's also a full barrier , It's actually a combination of reading barrier and writing barrier , There are two kinds of barriers , stay linux It is smp_mb() Function . At the beginning of the article LOCK Instructions actually act as a memory barrier .# A few questions ### The problem is 1: CPU Adopt MESI The protocol implements cache synchronization , Why do you want to LOCK answer :1. MESI The protocol only maintains cache consistency , It's about visibility , It's not about atomicity . A nonatomic instruction needs to add lock Prefixes guarantee atomicity .### The problem is 2: Is an assembly instruction atomic 1. `read-modify-write Memory ` Our instructions are not atomic , With `INC mem_addr` For example , Let's assume that the data is cached in cache On , The execution of the instruction requires the data to be transferred from cache Read to the execution unit , Reexecution +1, Then write back to cache.2. For misaligned memory , Reading memory may require multiple reads , It's not atomic .( In some CPU Reading misaligned memory on is not allowed )3. Other unknown reasons ...### The problem is 3: Go Atomic read in. Let's look at a read 8 Examples of byte data , Look directly at golang `atomic.LoadUint64()` Compilation :```// uint64 atomicload64(uint64 volatile* addr);1. TEXT runtime∕internal∕atomic·Load64(SB), NOSPLIT, $0-122. MOVL ptr+0(FP), AX // Load the first argument into AX register 3. TESTL $7, AX // Determine whether the memory is aligned 4. JZ 2(PC) // Jump to the next two of this command , That is to jump to the second 6 That's ok 5. MOVL 0, AX // crash with nil ptr deref quote 0x0 The address triggers an error 6. MOVQ (AX), M0 // Load the data pointed to by the memory address into M0 register 7. MOVQ M0, ret+4(FP) // Will M0 Data in the register ( Where the memory points to ) Give the return value 8. EMMS // eliminate M0 register 9. RET``` The first 3 That's ok TESTL The instruction bitwise sums two operands , If it turns out to be 0, Will ZF Set to 1, Otherwise 0. So this line actually judges whether the memory address passed in is 8 Integral multiple of . The first 4 That's ok JZ Command judgment if ZF That is, the zero flag bit is 1 The execution jumps to the position specified by the second operator , The third line is if the memory address passed in is 8 Integral multiple of , That is, the memory is aligned , Then jump to the second 6 That's ok , Otherwise, go ahead . For memory alignment, take a look at my article :[ Understanding memory alignment ](https://blog.fanscore.cn/p/24/) . Although MOV Instructions are atomic , But there seems to be no memory barrier in the assembly , that Golang How to achieve visibility ? I don't fully understand , But it probably means Golang Of atomic It's going to keep the order consistent .### The problem is 4:Go The atom in is still written as a 8 Take the operation of byte data as an example , Look directly at golang `atomic.LoadUint64()` Compilation :```TEXT runtime∕internal∕atomic·Store64(SB), NOSPLIT, $0-16 MOVQ ptr+0(FP), BX MOVQ val+8(FP), AX XCHGQ AX, 0(BX) RET``` Although there is no LOCK Instructions , but XCHGQ The command has LOCK The effect of , So it's still atomic and visible .# It took me a lot of time and energy to summarize this article , The main reason is that atomicity is just a small problem at first , But as we dig deeper , Read countless materials , Only then discovered under has hidden innumerable pit , Facing the vast computer world , I feel very small and ignorant .[![s70KdH.png](https://s3.ax1x.com/2021/01/23/s70KdH.png)](https://imgchr.com/i/s70KdH) Due to energy reasons, there are still some very important points not mentioned in this paper , such as acquire/release Semantics and so on . In addition, objectively speaking, there are many problems in this paper , If it is true, it may cause some trouble to you , I'm sorry , It is suggested that you take this article as an opportunity for you to study the underlying computer architecture , Research on this technology by yourself .# References * [ The origin of the memory barrier ](https://zhuanlan.zhihu.com/p/125549632)* [MESI Agreement - Wikipedia , An encyclopedia of freedom ](https://zh.wikipedia.org/wiki/MESI%E5%8D%8F%E8%AE%AE)* [c++ - Does golang atomic.Load have a acquire semantics? - Stack Overflow](https://stackoverflow.com/questions/55787091/does-golang-atomic-load-have-a-acquire-semantics)* [golang-notes/memory_barrier.md at master · cch123/golang-notes · GitHub](https://github.com/cch123/golang-notes/blob/master/memory_bar

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/01/20210123232234480v.html

随机推荐