当前位置:网站首页>The underlying principle of left join in SQL (Exploring the complexity of various joins)

The underlying principle of left join in SQL (Exploring the complexity of various joins)

2021-10-14 06:23:37 Old fellow iron has dried up the code.

01. Preface

Have written or learned SQL We should all know left join, know left join The effect of the realization of , That is to keep all the information in the left table , Then put the right watch on the left one , If you can't spell it, it's null. except left join outside , also inner join、outer join、right join, These are different join What kind of effect can be achieved , We should all know , If you don't know, you can look at the online post or any one of them SQL It's all about books . We're not going to talk about that today join What can be achieved , We mainly talk about these join How is the underlying principle of the implementation of , That is, how the concrete effect is presented .

join There are mainly Nested Loop、Hash Join、Merge Join These three ways , We're just going to talk about the most common , It's also the best understood Nested Loop,Nested Loop Nested loop means nested loop , What is nested loop ? Nesting should be understandable , It's a layer by layer ; What about the cycle , You can understand it as for loop .

Nested Loop There are three kinds of connection ways of subdivision , Namely Simple Nested-Loop Join、Index Nested-Loop Join、Block Nested-Loop Join, Next, we will take a look at the connection of these three subdivisions .

Before the official start , Let's start with two concepts : The driver table ( It's also called appearance ) And the driven table ( It's also called a non driver table , It can also be called a matching table , It can also be called internal watch ), Simply speaking , The driver table is the main table ,left join The left table in is the driver table ,right join The right table in is the driver table . One is the driver table , Then the other one can only be a non driver table , stay join In the process of , In fact, it's from the driver list ( Pay attention to understand the order in this ) Take out every value , Then go to the non driver table for matching , How exactly does it match ? These are the three kinds of connections that we're going to talk about next .

02.Simple Nested-Loop Join

Simple Nested-Loop Join It's the simplest of the three , It's better to understand , It's also a connection that most conforms to people's cognition , Now there are two tables table A and table B, We let table A left join table B, If it's the first way to connect , How to match it ? Directly above :
 Insert picture description here

above left join From the driver table table A Take out each value in turn , Then go to the non driver table table B From top to bottom , Then return the matched value , Finally, merge all the returned values , So we find table A left join table B Result . Is it the same as your perception ? Use this method , If table A Yes 10 That's ok ,table B Yes 10 That's ok , In total, it needs to be carried out 10 x 10 = 100 Queries .

This method of violent matching is generally not used in the database .


03.Index Nested-Loop Join

Index Nested-Loop Join In this way , We see that Index, As you all know, this is what index means , This Index It requires the index on the non driven table , With an index, you can reduce the number of matches , Reducing the number of matching can improve the efficiency of the query .

Why can we reduce the number of queries after the index is available ? This actually involves some knowledge in the data structure , Just give you an example .

 Insert picture description here

On the left side of the figure above is the storage method of ordinary columns , On the right is a tree structure index , What is the structure of a tree ? It's data distribution layer by layer like tree , A feature of tree structure is that the data on the left is less than the number of vertices , The number on the right is greater than the number of vertices , Look at the picture on the right , The number on the left 3 Is it less than the vertex 6, The number on the right 7 Is it greater than the vertex 6; The number on the left 1 Is it less than the vertex 3, The number on the right 4 Is it greater than the vertex 3.

If we want to match numbers now 9, If this is the data storage mode on the left , We need to match from the first line to the last line in order to find the value 9, All in all, you need to match 7 Time ; But if we use the tree structure index on the right , Let's take it first. 9 And the top vertex 6 To match , Find out 9 Than 6 Big , We'll go to the right of the apex to find , Go back and 7 matching , Find out 9 Still more than 7 Big , Go again 7 To the right of , I found it 9, So we only match 3 Take what we want 9 eureka . Is it a match 7 It saves a lot of time .

The index in the database is usually used B+ Trees , In order to better understand , The picture I drew above is just one of the simplest tree structures , Not the real B+ Trees , But the principle is the same .

If the index is a primary key , More efficient , Because the primary key must be unique , So if the driven table is joined with a primary key , There will only be many to one or one-to-one situations , There won't be many to many and one to many .

04.Block Nested-Loop Join

Ideally , Index matching is the most efficient way , But in real work , Not all columns are indexed columns , It's needed at this time Block Nested-Loop Join The method , This method is similar to the first one , The only difference is that it will put the driver table left join All the columns involved ( It's not just for on The column of , also select Columns of parts ) Take it out and put it in a cache area , And then match it with the non driver table , This method requires the same number of matches as the first method , The difference lies in the number of columns that drive the table , That is, the amount of data is different . So although the number of matches doesn't decrease , But the overall query performance has been improved .

Join Operation is a common database operation , adopt Join You can associate multiple tables , Provide data together according to the user's conditions . General situation , There are many built-in in databases Join Algorithm , The optimizer will optimize according to SQL Select the appropriate algorithm based on the statistics of statements and tables .

Hash Join

 Insert picture description here

In execution Hash Join when ,1. Will be based on Join The condition is a table Hash The operation is loaded into memory Hash In the table .Hash Table similar to Java Medium HashTable;2. Traverse another table , Conduct Hash After the operation, find the records that meet the conditions in memory .

select * from t1 join t2 on t1.a = t2.b; In the execution of this SQL When , Load the table first t1 The data of , Then according to the table t1 Of a Field as key structure Hash surface . after , From the table t2 Take out the records one by one , Calculated field b Of Hash value , Go to Hash Find out whether there are records that meet the conditions in the table .

Hash Join High performance , But the precondition is that one of the tables can be stored in memory Hash surface . Therefore, it is generally applicable to size tables Join. In some big data analysis data query engines , When the memory can't hold this Hash Table time , The small table will be partitioned and saved to disk , We'll do it later Join.

Nested loop Join

 Insert picture description here

Nested loop Join in , At least one table has an index , And Join The condition is the comparison of index columns . A table with an index is used as the retrieved table , Traverse the smaller of the two tables without or with indexes . This algorithm makes full use of the advantages of index , Give Way Join The time complexity of is from O(m*n) Turned into O(n), among m Is the number of rows in the retrieved table ,n Is the number of rows traversing the table .


Merge Hash

Compared with the above two algorithms , The performance of this algorithm is worse , But it can be used more widely . In this algorithm , Sort the data in the two tables , Then take a paragraph separately for Join.


Semi Join

Half a connection , For the table on the left, output the records that meet the conditions , The table on the right will not be output regardless of whether the conditions are met , That is to say , The final result is a subset of the data records in the left table , Be similar to in、exists.Semi Join Itself is Join A kind of . In the query of big data across data sources ,Semi Join It's right inner join、left join、right join An optimization of . When querying across data sources , Minimizing the amount of data from each data source is a very effective optimization method , After all, network transmission takes time . take Join Turn it into Semi Join Is an effective way to reduce the amount of data .

about :select * from t1 join t2 where t1.a = t2.b,Semi Join The process is as follows :

1. Will table t1 The data is loaded into memory ;

2. according to t1 The data of , Overwrite load table t2 Conditions , the SQL Change the sentence to in、exists etc. . Hypothesis table t1 in , All recorded a Field has only two values :aa and bb, that SQL Will be rewritten as select * from t2 in (‘aa’,‘bb’);

3. Pair slave table t1 and t2 Load the data to do Join;

The first 2 Load... In step t2 Data SQL Rewriting of , Make it necessary to load the whole t2 Change the table to load only t2 Data that meet the conditions in .

版权声明
本文为[Old fellow iron has dried up the code.]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/10/20211002145845309l.html

随机推荐