# I've been using these tips to improve the efficiency of code!

2021-05-04 16:57:00

The purpose of our program is to make it work stably in any situation . A program that runs very fast but gets the wrong result is useless .

In the process of program development and optimization , We have to think about how the code is used , And the key factors that affect it .

Usually , We have to make a trade-off between the simplicity of the program and its speed . Today we will talk about how to optimize the performance of the program .

#### One 、 Reduce the amount of program calculation

1. Sample code 2. Analysis of the code

Code as above , Every time the outer loop is executed , We're going to do a multiplication calculation .i = 0,ni = 0;i = 1,ni = n;i = 2,ni = 2n. therefore , We can replace multiplication with addition , With n Step length , This reduces the amount of code in the outer loop .

3. Improve the code The multiplication instruction in a computer is much slower than the addition instruction .

#### Two 、 Extract the common parts of the code

1. Sample code

Imagine , We have an image , We represent the image as a two-dimensional array , Array elements represent pixels . We want to get the east of a given pixel 、 south 、 In the west 、 The sum of the four neighbors in the North . And find their average or their sum . The code is as follows . 2. Analysis of the code

After compiling the above code, the assembly code is as follows , Pay attention to 3,4,5 That's ok , There are three times n The multiplication of . Let's take the top one up and down After expansion, you will find that all four lattice expressions have i*n + j. therefore , You can extract the public part , And then through addition and subtraction, we can get up、down Equal value . 3. Improve the code The assembly of the improved code is as follows . There's only one multiplication after compilation . Less 6 Clock cycles （ A multiplication period is about 3 Clock cycles ）. about GCC For compilers , Compiler can be optimized according to different levels , There are different ways to optimize , Will automatically complete the above optimization operation . Let's introduce , Those have to be optimized manually .

#### 3、 ... and 、 Eliminate inefficient code in loops

1. Sample code

There seems to be nothing wrong with the program , A very common case conversion code , But why is it that as the input length of the string gets longer , The execution time of the code will increase exponentially ？ 2. Analysis of the code

So let's test the code , Enter a series of strings . lower1 Code performance testing

When the length of the input string is less than 100000 when , There is little difference in the running time of programs . however , As the length of the string increases , The running time of the program increases exponentially .

We convert the code into goto Look at the form . The above code is divided into initialization （ The first 3 That's ok ）, test （ The first 4 That's ok ）, to update （ The first 9,10 That's ok ） In the third part of . Initialization will only be performed once . But tests and updates are executed every time . Every time you do a cycle , Will be right. strlen Call once .

So let's see strlen Function source code is how to calculate the string length . strlen Function to calculate the length of a string ： Traversal string , Until I met ‘\0’ Will stop . therefore ,strlen The time complexity of the function is O（N）.lower1 in , For length is N In terms of string ,strlen The number of calls to is N,N-1,N-2 ... 1. For a linear time function call N Time , Its time complexity is close to O（N2）.

3. Improve the code

For such redundant calls in loops , We can move it out of the loop . Apply the results to the cycle . The improved code is as follows . Compare the two functions , As shown in the figure below .lower2 The execution time of the function is significantly improved . lower1 and lower2 Code efficiency

#### Four 、 Eliminate unnecessary memory references

1. Sample code

The following code functions as , Calculation a The sum of all elements in each row of the array exists b[i] in . 2. Analysis of the code

The assembly code is as follows ： This means that each loop needs to be read from memory b[i], Then take it. b[i] Write back to memory .b[i] +=  b[i] + a[i*n + j]; Actually, at the beginning of each cycle ,b[i] It's the last value . Why read it from memory and write it back every time ？

3. Improve the code The assembly looks like this ： The improved code introduces temporary variables to save intermediate results , Only when the final value is calculated , To store the result in an array or global variable .

#### 5、 ... and 、 Reduce unnecessary calls

1. Sample code

For the convenience of example , We define a structure that contains an array and its length , The main purpose is to prevent array access from crossing the bounds ,data_t It can be int,long Other types . The details are as follows .  vec Vector diagram

get_vec_element Function is used to traverse data Array and stored in val in . We'll take the following code as an example to start the step-by-step optimization process . 2. Analysis of the code

get_vec_element The function gets the next element , stay get_vec_element Function , Every cycle has to do with v->len The comparison , To prevent cross-border . It's a good habit to do border checks , But every time it's done, it's inefficient .

3. Improve the code

We can move the code for the length of the vector to extracorporeal circulation , At the same time, the abstract data type adds a function get_vec_start. This function returns the starting address of the array . So there are no function calls in the loop body , It's direct access to arrays . #### 6、 ... and 、  Loop unrolling

1. Sample code

We are combine2 Improve on the code of .

2. Analysis of the code

Loop unrolling is done by increasing the number of elements calculated for each iteration , Reduce the number of iterations of the loop .

3. Improve the code In the improved code , The first loop processes two elements of the array at a time . That is, every iteration , Loop index i Add 2, In an iteration , For array elements i and i+1 Use the merge operation . In general, we call it 2×1 Loop unrolling , This transformation can reduce the impact of loop overhead .

Pay attention not to cross the border , Set correctly limit,n Elements , Generally set the boundaries n-1

#### 7、 ... and 、 Cumulative variable , Multi channel parallel

1. Sample code

We are combine3 Improve on the code of .

2. Analysis of the code

For a combinable and exchangeable merge operation , For example, integer addition or multiplication , We can divide a combination into two or more parts by combining and calculating , And merge the results at the end to improve performance .

Particular attention ： Don't combine floating point numbers easily . The encoding format of floating-point numbers is different from other integer numbers .

3. Improve the code The above code is expanded in two loops , To merge more elements per iteration , Two parallel channels are also used , Accumulate elements with even index values in variables acc0 in , And elements with odd index values accumulate in variables acc1 in .

therefore , Let's call this ”2×2 Loop unrolling ”. Application 2×2 Loop unrolling . By maintaining multiple cumulative variables , This approach takes advantage of multiple functional units and their pipelining capabilities

#### 8、 ... and 、 Recombine transformation

1. Sample code

We are combine3 Improve on the code of .

2. Analysis of the code

In fact, the performance of the code is close to the limit , Even if we do more loop deployment, the performance improvement is not obvious .

We need to think differently , Pay attention to combine3 In the code 12 Lines of code , We can change the order of merging the next vector elements （ Floating point numbers don't apply ）. Before reconnection combine3 The critical path of the code is shown in the figure below . combine3 The critical path to code

3. Improve the code Re combining transformation can reduce the number of operations on the critical path in computation , This method increases the number of operations that can be performed in parallel , Make better use of the pipeline capability of functional units to get better performance . The critical path is as follows . combine3 The critical path after recombining

#### Nine 、 Conditional delivery style code

1. Sample code 2. Analysis of the code

The pipeline performance of modern processor makes the processor work far ahead of the current instructions .

Branch prediction in the processor predicts where to jump next when it encounters a comparison instruction . If the prediction is wrong , We're going to go back to the branch jump . Branch prediction error will seriously affect the execution efficiency of the program .

therefore , We should write code to improve the prediction accuracy of the processor , That is to use conditional transfer instruction . We use conditional operations to calculate values , Then use these values to update the program state , As shown in the improved code .

3. Improve the code In the first part of the original code 4 In line , Need to be right a[i] and b[i] Compare , Go on to the next step , The consequence is to make predictions every time .

The improved code implements this function to calculate each location i Maximum and minimum of , Then assign these values to a[i] and b[i], Instead of branch prediction .

#### * summary

We introduced several techniques to improve code efficiency , Some are automatically optimized by the compiler , Some of them need to be realized by ourselves . It is summarized as follows ：

Eliminate continuous function calls . It is possible that , Move the calculation out of the loop . Consider selectively compromising the modularity of the program to achieve greater efficiency .

Eliminate unnecessary memory references . Introduce temporary variables to save intermediate results . Only when the final value is calculated , To store the result in an array or global variable .

Expand the cycle , Reduce overhead , And make further optimization possible .

By using techniques such as multiple cumulative variables and recombination , Find ways to improve instruction level parallelism .

Rewrite conditional operations in a functional style , Make compilation conditional data transfer . Whether you change careers or not , It's better to be a beginner , It's OK to be advanced , If you want to learn programming , Advanced programmers ~

【 Worthy of attention 】 my Programming learning exchange Club ！【 Click to enter 】

C Introduction to language ： C Language must read books ： https://chowdera.com/2021/05/20210504165533161B.html