当前位置:网站首页>Programmer programming art chapter 36 ~ 37, search intelligent tips suggestion, nearby point search

Programmer programming art chapter 36 ~ 37, search intelligent tips suggestion, nearby point search

2020-12-07 13:18:51 osc_ o494ayqf

        Thirty six ~ Chapter 37 、 Search for smart tips suggestion, Search for nearby places


author :July. thank :caopengcs、 Hu Guoguo .
Time : September 7, 2013 .


.

    Nearly three years of blogging , I have arranged too many written interview questions , Such as Microsoft Interview 100 Question series , And now this Programmer programming art series , I really think the topic changes year by year , But the solutions are always those , After careful preparation , I will find that there are traces to follow .

    So to better help people find jobs , We are going to hold a series of interview & Algorithm Lecture . It's a weekend , One morning or afternoon at a time , The target audience is friends who want to find or change jobs or are interested in algorithms , I'd like to pay the expenses in advance , How much you pay is up to you . Speaker : Me and the present zoj Number one caopengcs Doctor .9 month 15 Day is the first 1 Seminars http://blog.csdn.net/v_july_v/article/details/7237351#t22.

    OK, Cut to the chase . It says that I have arranged many written interview questions , But good written interview questions are really hard to ask , Including topics in each chapter of the programming art series , It's more difficult to pick from behind , But this article writes two questions which link with the reality , They come from this article http://blog.csdn.net/v_july_v/article/details/7974418  Of The first 3.6 topic , And the 87 topic , namely

  • Chapter 36 、 Intelligent key words in search engines suggestion;
  • Chapter 37 、 Search of nearby places ;
    Both topics of this paper are relatively open , There is no fixed standard answer . If readers have any comments , Or found any problems , Please feel free to leave a message or make corrections under this review , thank .




Chapter 36 、 Search key words intelligent prompt suggestion

Topic details : Baidu search box , Input “ Beijing ”, The search box will be prefixed with Beijing , Exhibition “ Beijing love story ”、“ Beijing bus ”、“ Beijing Hospital ” Wait for the search words , Input “ Structure ”, Will prompt “ The method of structure ”,“ The method of structure The way of algorithm ” Wait for search words .
Excuse me, , How to design this system , Make space and time complexity as low as possible .

Topic analysis : This question comes from last year 2012 Baidu in a set of trainee written examination questions in the system design questions ( For the sake of respect , This chapter mainly uses Baidu search engine to discuss , instead of google And other search engines , But the principle is not too bad . But away from the subject , Usually when searching , Encourage the use of ...), The topic is more open , The purpose of the investigation is to see if the candidate's thinking of solving the problem is clear , The second is to see how many details can be considered .

    When I sorted out this question last year , I have simply analyzed , The method proposed is :
  • Go straight up Trie Trees Trie See you for the introduction of the tree : from Trie Trees ( Dictionary tree ) When it comes to suffix trees 」 +  TOP Khashmap+ Pile up ,hashmap+ Pile up The statistics are as follows 10 A similar hot word , in other words , Only the ones that are similar to the key words, such as 10 A hot word
    That's how it works :Trie Trees +TOP K Algorithm , But in practice , Really as long as Trie Trees + TOP K Algorithm is enough , What are the details to consider ?OK, Please take a look at the following .

Solution 1 、Trie Trees + TOP K

  Step one 、trie The tree stores the Prefix suffix

    If you read this introduction in the blog Trie Tree and suffix tree articles http://blog.csdn.net/v_july_v/article/details/6897097 Words , Should be able to be right trie Trees have a general understanding of , To show the integrity of this article , Quote the original , as follows :
1.1、 What is? Trie Trees

    Trie Trees , Dictionary tree , Also known as word search tree or key tree , It's a tree structure , It's a variation of the hash tree . A typical application is to count and sort a large number of strings ( But it's not just about strings ), So it is often used in text word frequency statistics by search engine system . Its advantages are : Minimize unnecessary string comparisons , Query efficiency is higher than hash table .

    Trie The core idea is space for time . The common prefix of string is used to reduce the cost of query time to improve the efficiency .
It has 3 A basic property :

  1. The root node contains no characters , Each node contains only one character except the root node .
  2. From the root to a node , The characters passing through the path are concatenated , Is the string corresponding to the node .
  3. All children of each node contain different characters .

1.2、 The construction of trees

For example, it is widely spread on the Internet , as follows :
    subject : Here you are. 100000 No longer than 10 's words . For every word , We have to judge whether he's been around , If it does , Find the first place to appear .
    analysis : Of course, this question can be used hash To solve , But this article focuses on trie Trees , Because in some ways it's more useful . For example, for a certain word , We're going to ask if its prefix has ever appeared . such hash It's not easy , While using trie It's still very simple .
    Now back to the example , If we use the stupidest method , For every word , We're all going to find out if there's a word in front of it . So the complexity of this algorithm is O(n^2). Obviously for 100000 It's hard to accept the scope of . Now let's change our thinking . Let's say the word I want to query is abcd, So in the words before him , With b,c,d,f I obviously don't have to think about something like that at the beginning . And just look for a Is there... In the beginning abcd That's all right. . alike , In order to a In the beginning of the word , We just have to think about b As the second letter , Narrow the scope and improve the pertinence , The model of such a tree is gradually clear .
    It's like assuming that there is b,abc,abd,bcd,abcd,efg,hii this 6 Word , The tree we built is like this :


  When I first saw this picture , He immediately felt the extraordinary structure of this tree . Just from the above picture, we can see one or two , It's like searching people in the sea , We can immediately determine which direction in the southeast, northwest and middle , So quickly reduce the scope of search and improve the pertinence of search , It's an initiative .
    ok, As shown in the figure above , For each node , The process from root to root is a word , If the node is marked as red color , It means that the word exists , Otherwise, it doesn't exist .
    that , For a word , I just follow him from the root to the corresponding node , See if this node is marked in red to see if it has appeared . Mark this node in red , It's like inserting the word .

    Borrow the picture above , When the user enters a prefix a When , The search box may be displayed with a Prefixed “abcd”,“abd” Other keywords , When the user enters the prefix b When , The search box may prompt with b Prefixed “bcd” Other keywords , such , Realize search engine intelligent prompt suggestion The first step is clear , The box trie Trees store a lot of strings , When the current patch is fixed , Store relatively hot suffixes . So how to count hot words ? Please see Step 2 below 、TOP K Algorithm statistics hot words .

  Step two 、TOP K Algorithm statistics hot words

    When each search engine enters a prefix , It will only show 0~10 A candidate word , But if there are many candidates , How to choose , Which are shown in the front , Which are shown in the back ? This is a question of search popularity .

    As described in this title , At this time last year , When I search in the search box “ Beijing ” when , It will prompt with “ Beijing ” Prefixed such as “ Beijing love story ”,“ Beijing bus ”,“ Beijing Hospital ”, And “ Beijing love story ” Show in the first :

    Why input “ Beijing ”, Will first prompt “ Beijing love story ” Well ? Because this time last year , It is 《 Beijing love story 》 When the TV play was on the air ( Its release date is 2012 year 1 month 8 Japan , Fire for at least a year ), At that time, we all searched the relevant information of the TV play , When 10 Personal input “ Beijing ” after , Among them is 8 Individuals will continue to type in “ Love story ”( All in all “ Beijing love story ”) When , Search engines are certainly not indifferent to this .

    in other words , Search engines know this time period , Everyone is frantically searching for Beijing love story , Therefore, when the user input with “ Beijing ” When prefixed , Search engines guess that users have 80% The probability is to find “ Beijing love story ”, Therefore, the “ Beijing love story ” Here's a hint , And put it in the first place .

    But why search again this time of year “ Beijing ” When , It shows different words ?

    

    The reason is that over time , People are right. 《 Beijing love story 》 The attention of this TV play is decreasing , meanwhile , There are new hot words , Or a new movie , So now it's the same input “ Beijing ”, The words prompted later also changed accordingly . What's the way to solve this problem ? As I said at the beginning : Regularly analyze the keywords people search for in a certain period of time , Count the hot words with more searches , Then when the user enters a prefix , Give priority to hot words .

    So to put it bluntly , The second step of this problem is to count hot words , We call the method of counting hot words TOP K Algorithm , This is the application scenario of this algorithm http://blog.csdn.net/v_july_v/article/details/7382693 No 2 A question , Again, I quote :

Looking for hot queries ,300 The most popular query string among ten thousand 10 individual Inquire about

     The original title is : The search engine will record all the search strings used by the user every time through the log file , The length of each query string is 1-255 byte . Let's say there are 10 million records ( These query strings have high repeatability , Although the total is 1 Ten million , But if the duplicate is removed , No more than 3 Million . The higher the repetition of a query string , It means that the more users query it , That is, the more popular ), Please count the hottest 10 Query string , It is required that the memory used should not exceed 1G.

     answer : By the above 1 topic , We know , Big data is small , Like 100 million Ip seek Top 10, May first %1000 take ip be assigned to 1000 Go to... In a little file , And guarantee a kind of ip It only appears in one file , Then for each small file ip Conduct hashmap Count and sort by quantity , The last merge or minimum heap processes each small file in turn top10 To get the final result .

    But if the data size itself is smaller , It can be loaded into memory at one time ? For example, No 2 topic , Although there are ten million Query, But because of the high degree of repetition , So in fact, there is only 300 Ten thousand Query, Every Query255Byte, So we can think about putting them all in memory (300 Ten thousand strings assume no repetition , It's the maximum length , So it takes up the most memory 3M*1K/4=0.75G. So all strings can be stored in memory for processing ), Now it just needs a proper data structure , ad locum ,HashTable It's definitely our priority .

     So we give up divide and rule /hash Mapping steps , Go straight up hash Statistics , Then sort .So, For such Typical TOP K problem , The response is often :hashmap + Pile up . As shown below :

  1. hashmap Statistics : First, preprocess the massive data . The way to do it is : Maintain a Key by Query String ,Value For the sake of Query The number of times HashTable, namely hash_map(Query,Value), Read one at a time Query, If the string is not in Table in , Then add the string , And will Value Value to 1; If the string is in Table in , Then add one to the count of the string . In the end, we are O(N) Of time complexity Hash The table completes the statistics ;
  2. Pile up Sort : The second step 、 With the data structure of heap , find Top K, The time complexity is N‘logK. With the help of the reactor structure , We can do it in log Search and adjust in time of magnitude / Move . therefore , Maintain a K( The topic is 10) A small pile of roots , Then traverse 300 Ten thousand Query, Compare with root element respectively . therefore , Our ultimate time complexity is :O(N) + N' * O(logK),(N by 1000 ten thousand ,N’ by 300 ten thousand ).

    Don't forget the idea of heap sorting described in this article :‘ maintain k The smallest heap of elements , That is, the capacity used is k The smallest heap store of the first to traverse k Number , And assume that they are the biggest k Number , It takes time to build a pile O(k), And adjust the pile ( Time consuming O(logk)) after , Yes k1>k2>...kmin(kmin Set as the minimum element in the small top heap ). Continue traversing the sequence , One element at a time x, Compared with the top element , if x>kmin, Then update the heap (x Into the heap , when logk), Otherwise, the heap will not be updated . So down , It takes time O(k*logk+(n-k)*logk)=O(n*logk). This method benefits from being in the pile , The time complexity of search and other operations is logk.’-- The third chapter continued 、Top K The realization of algorithm problem .
    Of course , You can also use trie Trees , The number of times the query string appears in the keyword field , There is no such thing as 0. Last use 10 The minimum push of elements to sort the frequency of occurrence .

    Believe in , such , It is not difficult to understand the method proposed at the beginning :Trie Trees +  TOP K「hashmap+ Pile up ,hashmap+ Pile up The statistics are as follows 10 A similar hot word , in other words , Only the ones that are similar to the key words, such as 10 A hot word 」.

    And you can tell your friends later , Why input “ Structure ”, There will be a bunch of hints to “ Structure ” A word with a prefix :

    

    The method seems to have taken shape , But what are the details that need attention ? Such as @ Jiangshen _Johnson said :“ In practice , For example, when the current patch is very short , When there are many candidates , There may be problems with query and sorting performance , Maybe we can add a layer of index trie( This layer index can only index words whose frequency is higher than a certain threshold , Check this in a short time . If the number is not enough, check and index all the words trie Trees ); And sometimes it can't be based on query Frequency , To guide users to input more comprehensive information query, Or maybe it's not just prefix matching .”

Extended reading

    In addition to the above trie Trees , Trident tree may also be a good solution :http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/. Besides ,StackOverflow There are also two discussion posts on , You can see that :①http://stackoverflow.com/questions/2901831/algorithm-for-autocomplete,②http://stackoverflow.com/questions/1783652/what-is-the-best-autocomplete-suggest-algorithm-datastructure-c-c.


Chapter 37 、 Search for nearby places

Topic details : Find a point set closest to a given point , meanwhile , Given two-dimensional point sets are fixed , There may be many queries , Time complexity O(n) unacceptable , Please design the data structure and the corresponding algorithm .

Topic analysis : This is a three-way question from Microsoft last year , Like a friend @ Chen Li Ren The question that came out : Search for nearby places , It is to search for places near users . With GPS And with GPS The popularity of functional mobile devices , Search for nearby places has become hot . Search for places in a huge geographic database , Index is very important . however , Our need is to search nearby places , for example , coordinate (39.91, 116.37) near 500 What kind of restaurant is there in Minai , So let you design , What to do ?

Solution 1 、R Tree 2D search

    Suppose that only junior high school mathematics is allowed , Then you may build a X-Y Coordinate system , That is to coordinate (39.91, 116.37) For the center of a circle , With 500 The length of is radius , Draw a garden , Then one coordinate point to find . This seems to work , But the complexity can be imagined , Even if you think you're smart enough to divide the whole plane into four quadrants , A quadrant by quadrant search , This is not optimized enough , But it also means that you come up with ideas step by step .

    That is, the search of different coordinate points , It's an area by area search , relatively speaking , Its average search speed and efficiency will be significantly improved . such , Then I naturally thought about whether there is a data structure that locates in an area at a time ?

    If you read the introduction in the blog before R This article of the tree http://blog.csdn.net/v_JULY_v/article/details/6530142#t2  Our readers will immediately realize that ,R Tree is to solve the problem of finding and then reducing the scale of this area . Quote the original directly :

R The data structure of the tree

    R Trees are B The expansion of trees in high dimensional space , It's a balance tree . Every R The leaf node of the tree contains several pointers to different data , The data can be stored in the hard disk , It can also be stored in memory . according to R This data structure of the tree , When we need to do a high dimensional spatial query , We only need to traverse the pointers contained in a few leaf nodes , Check whether the data pointed by these pointers meet the requirements . This way we don't have to traverse all the data to get the answer , Significant improvement in efficiency . The figure below 1 yes R A simple example of a tree :

    We said that above ,R Trees use the idea of space division , How does this idea come true ?R The tree uses a type called MBR(Minimal Bounding Rectangle) Methods , Here I translate it into “ Minimum bounding rectangle ”. Start with a rectangle from the leaf node (rectangle) Frame the space , Node up , The more space you frame , In order to divide the space . I don't understand ? No problem , Keep looking down . I would also like to mention ,R In the tree R It should represent Rectangle( For reference here wikipedia on R Trees Introduction to ), Not what most domestic textbooks say Region( A lot of books R Trees are called area trees , This is wrong ). Let's take two-dimensional space as an example . The picture below is Guttman A picture in the paper :

 

    Let me explain this picture in detail .

  1. Look at the picture first (b), First of all, let's assume that all the data are points in two-dimensional space , The picture just marks R8 Data in the area , That's the one shape of data object. Don't think of that irregular figure as a piece of data , We think of it as a region of data . In order to achieve R Tree structure , We use a minimum bounding rectangle to frame the irregular region , such , We've constructed an area :R8.R8 The characteristics are obvious , Just frame all the data in this area .
  2. Other areas surrounded by solid lines , Such as R9,R10,R12 It's the same thing . thus , We got... Together 12 The most basic smallest rectangle . These rectangles will be stored in the child nodes .
  3. The next step is to do a higher level of processing . We found that R8,R9,R10 The three rectangles are the closest , So you can use a bigger rectangle R3 Just frame this 3 A rectangle .
  4. In the same way ,R15,R16 By R6 Just frame ,R11,R12 By R4 Just frame , wait . After all the most basic minimum bounding rectangles are framed into larger rectangles , Iterate again , Frame these rectangles with larger frames .

    I think everyone should understand the characteristics of this data structure . Use the example of a map to explain , All the data is the location of the restaurant , First, divide the adjacent restaurants into the same area , After dividing all the restaurants , Then divide the adjacent areas into larger areas , After the division is completed, a higher level division will be carried out again , Until it's divided into only two largest areas . It's convenient to search .

    Now we can store these large and small rectangles in our R In the tree . The root node holds the two largest rectangles , These two largest rectangles frame all the remaining rectangles , Of course, it frames all the data . The nodes of the next layer store the next largest rectangle , These rectangles narrow the range . Each leaf node is the smallest rectangle in storage , These rectangles may contain n Data .

Examples of map searches

    After the basic data structure , Let's talk about an example , How to query specific data . Take the restaurant as an example , Suppose I want to find out the address of all the restaurants one kilometer away from Tianhe City, Tianhe District, Guangzhou ?

  1. Open the map ( That is the whole R Trees ), Choose domestic or foreign first ( That is, the root node );
  2. Then choose South China ( Corresponding to the node of the first layer ), Choose Guangzhou ( Corresponding to the second layer node ),
  3. Then choose Tianhe District ( Corresponding to the third layer nodes );
  4. Finally, choose the area where Tianhe City is located ( Corresponding leaf node , Store the smallest rectangle );

    Traverse all nodes in this region , To see if we can meet our requirements . What about? , Actually R The search rules of trees are very similar to those of maps ? Corresponding to the figure below :

        

A tree R Trees satisfy the following properties :

  1. Unless it's outside the root , All leaf nodes contain m to M A record index ( entry ). The number of records of leaf node as root node can be less than m. Usually ,m=M/2.
  2. For all records stored in leaves ( entry ),I Is the smallest rectangle that can completely cover the points represented by these records in space ( Be careful : Here say “ rectangular ” It can be extended to high-dimensional space ).
  3. Every non leaf node has m to M A child node , Unless it's the root .
  4. For each entry on a non leaf node ,i It is the smallest rectangle that can completely cover the store represented by these items in space ( Of the same nature 2).
  5. All leaf nodes are on the same layer , therefore R A tree is a balance tree .

The structure of leaf nodes

Let's first explore the structure of leaf nodes . The data saved by the leaf node is in the form of :(I, tuple-identifier).

       among ,tuple-identifier Represents a... Stored in a database tuple, It's a record , It is n Dimensional .I It's a n The rectangles of dimensional space , And you can just frame all the records in this leaf node n Points in dimensional space .I=(I0,I1,…,In-1). The structure shown below :

    The following figure describes the information to be stored by leaf nodes in two-dimensional space .

 

    In this picture ,I It represents the rectangle in the figure , Its scope is a<=I0<=b,c<=I1<=d. There are two tuple-identifier, These two points are shown in the figure . This form can be extended to high dimensional space . Just think about the three-dimensional space . such , The structure of leaf node is introduced .

Non leaf node

       The structure of non leaf node is very similar to leaf node . Imagine B The tree knows ,B The leaf node of the tree stores the real data , Instead of leaf nodes, they store the data “ The border ”, Or an index ( Readers with questions can review the first section above B Part of the tree ).

       In the same way ,R The data structure of the non leaf node of the tree is :(I, child-pointer).

       among ,child-pointer It's a pointer to a child's node ,I It's a rectangle covering all child nodes . It's a bit of a mouthpiece here , But I don't think it's hard to understand ? Give me a picture :

 

    D,E,F,G Rectangle corresponding to child node .A For larger rectangles that can cover these rectangles . This A This is the rectangle corresponding to the non leaf node . At this time, you should realize ? Whether leaf node or non leaf node , They all correspond to a rectangle . The rectangle corresponding to the node at the top of the tree structure can completely cover the rectangle corresponding to its child node . The root node also uniquely corresponds to a rectangle , And this rectangle can cover all the points represented by the data information we have in space .

    I personally feel that this picture is not so accurate , It should be a rectangle A To cover exactly D,E,F,G, Instead of leaving so much useless space . But in order to respect the painter of the original , There is no modification .

    but R What's wrong with trees ? Such as @ Song Xiao _CD said :“ Simple to use R Trees to index , Search nearby places , It's possible to traverse many branches of the tree . And when maps of the whole country or the whole province , There are many leaf nodes in the tree , The depth of the tree can also be a problem . In general, the location of the nearby nodes ( Two dimensional map point line surface ) Pretreated to page( The size is 4K Multiple ), In these page On the establishment of R The index of the tree .”

Solution 2 、GeoHash Algorithms index geographic location information

    When I discussed the problem of searching nearby points with some friends on Weibo , Apart from talking about R Trees , Several friends pointed out that GeoHash The algorithm can solve , That's why I got to know GeoHash Algorithm , This article http://blog.nosqlfan.com/html/1811.html  Articulate MongoDB With the help of GeoHash Algorithm to achieve the principle of geographic location index , It's explained by quoting its contents , as follows :

Support for geolocation indexing is MongoDB A highlight of , This is also the most popular LBS service foursquare  choice MongoDB One of the reasons . We know , The usual database index structure is B+ Tree, How to turn a geographic location into a buildable B+Tree In the form of . Let's first assume that we divide the whole map that we need to index into 16×16 Of the lattice , Here's the picture ( The lower left corner is the coordinate 0,0  The upper right corner is the coordinate 16,16):

     pure [x,y] The data of cannot be indexed , therefore MongoDB When building an index , Will be calculated according to the coordinates of the corresponding field can be used to index hash value , This value is called geohash, Now we take the coordinates on the map as [4,6] The point of ( The red fork in the picture ) For example . The first step is to divide the whole map into four equal sized pieces , Here's the picture :

    We can define the values of these four blocks after dividing them into four blocks , as follows ( The bottom left is 00, On the top left is 01, At the bottom right is 10, On the top right is 11):

    
    such [4,6] Dot geohash At present, the value is  00 Then cut each of the four pieces , as follows :
    At this time [4,6] The point is in the upper right area , The value on the top right is 11, such [4,6] Dot geohash Value to :0011 Continue to do the next two shards :


    The resulting [4,6] Dot geohash The value is :00110100
    So we use this value to index , Then the similar points on the map can be converted into the points with the same prefix geohash The value of .
    We can see , This geohash The accuracy of the value is proportional to the number of times the map is divided , The above example divides the map four times . and MongoDB The default is to do 26 Sub division , This value is controllable when indexing . The specific commands for building a two-dimensional geographic location index are as follows :
db.map.ensureIndex({point : "2d"}, {min : 0, max : 16, bits : 4})
    Among them bits The parameter is to divide it several times , The default is 26 Time . 

    Readers comment on @yuotulck: First of all, thank you for your article , But if it's new ( For example, I ) notice geohash There may be misunderstandings : Adjacency can be compared by prefixes ? In fact, this is wrong , For example, the prefix of the adjacent area code of the boundary block is different from the first one , That is to say geohash Close points in hash Values are not necessarily close .
    The knowledge above is from :http://www.cnblogs.com/step1/archive/2009/04/22/1441689.html, and geohash We can learn more about :
http://tech.idv2.com/2011/07/05/geohash-intro/.

    End of this chapter .


Reference links and recommended readings

  1. 2012 September, October, 2010 written interview 80 questions :http://blog.csdn.net/v_july_v/article/details/7974418;
  2. from Trie Trees ( Dictionary tree ) When it comes to suffix trees :http://blog.csdn.net/v_july_v/article/details/6897097;
  3. Teach you how to kill quickly :99% Mass data processing interview questions :http://blog.csdn.net/v_july_v/article/details/7382693;
  4. from B Trees 、B+ Trees 、B* The tree said R Trees :http://blog.csdn.net/v_july_v/article/details/6530142;
  5. The illustration MongoDB The principle of geographic location index :http://blog.nosqlfan.com/html/1811.html;
  6. 《Hbase actual combat 》 The first 8 Chapter 、 stay HBase Query GIS on ;


版权声明
本文为[osc_ o494ayqf]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207131536565o.html