当前位置：网站首页>Programmer programming art chapter 36 ~ 37, search intelligent tips suggestion, nearby point search
Programmer programming art chapter 36 ~ 37, search intelligent tips suggestion, nearby point search
20201207 13:18:51 【osc_ o494ayqf】
Thirty six ~ Chapter 37 、 Search for smart tips suggestion, Search for nearby places
author ：July. thank ：caopengcs、 Hu Guoguo .
Time ： September 7, 2013 .
.
Nearly three years of blogging , I have arranged too many written interview questions , Such as Microsoft Interview 100 Question series , And now this Programmer programming art series , I really think the topic changes year by year , But the solutions are always those , After careful preparation , I will find that there are traces to follow .
So to better help people find jobs , We are going to hold a series of interview & Algorithm Lecture . It's a weekend , One morning or afternoon at a time , The target audience is friends who want to find or change jobs or are interested in algorithms , I'd like to pay the expenses in advance , How much you pay is up to you . Speaker ： Me and the present zoj Number one caopengcs Doctor .9 month 15 Day is the first 1 Seminars ：http://blog.csdn.net/v_july_v/article/details/7237351#t22.
OK, Cut to the chase . It says that I have arranged many written interview questions , But good written interview questions are really hard to ask , Including topics in each chapter of the programming art series , It's more difficult to pick from behind , But this article writes two questions which link with the reality , They come from this article http://blog.csdn.net/v_july_v/article/details/7974418 Of The first 3.6 topic , And the 87 topic , namely
 Chapter 36 、 Intelligent key words in search engines suggestion;
 Chapter 37 、 Search of nearby places ;
Chapter 36 、 Search key words intelligent prompt suggestion
Excuse me, , How to design this system , Make space and time complexity as low as possible .
Topic analysis ： This question comes from last year 2012 Baidu in a set of trainee written examination questions in the system design questions （ For the sake of respect , This chapter mainly uses Baidu search engine to discuss , instead of google And other search engines , But the principle is not too bad . But away from the subject , Usually when searching , Encourage the use of ...）, The topic is more open , The purpose of the investigation is to see if the candidate's thinking of solving the problem is clear , The second is to see how many details can be considered .
 Go straight up Trie Trees 「Trie See you for the introduction of the tree ： from Trie Trees （ Dictionary tree ） When it comes to suffix trees 」 + TOP K「hashmap+ Pile up ,hashmap+ Pile up The statistics are as follows 10 A similar hot word , in other words , Only the ones that are similar to the key words, such as 10 A hot word 」
Solution 1 、Trie Trees + TOP K
Step one 、trie The tree stores the Prefix suffix
Trie Trees , Dictionary tree , Also known as word search tree or key tree , It's a tree structure , It's a variation of the hash tree . A typical application is to count and sort a large number of strings （ But it's not just about strings ）, So it is often used in text word frequency statistics by search engine system . Its advantages are ： Minimize unnecessary string comparisons , Query efficiency is higher than hash table .
Trie The core idea is space for time . The common prefix of string is used to reduce the cost of query time to improve the efficiency .
It has 3 A basic property ：
 The root node contains no characters , Each node contains only one character except the root node .
 From the root to a node , The characters passing through the path are concatenated , Is the string corresponding to the node .
 All children of each node contain different characters .
1.2、 The construction of trees
analysis ： Of course, this question can be used hash To solve , But this article focuses on trie Trees , Because in some ways it's more useful . For example, for a certain word , We're going to ask if its prefix has ever appeared . such hash It's not easy , While using trie It's still very simple .
Now back to the example , If we use the stupidest method , For every word , We're all going to find out if there's a word in front of it . So the complexity of this algorithm is O(n^2). Obviously for 100000 It's hard to accept the scope of . Now let's change our thinking . Let's say the word I want to query is abcd, So in the words before him , With b,c,d,f I obviously don't have to think about something like that at the beginning . And just look for a Is there... In the beginning abcd That's all right. . alike , In order to a In the beginning of the word , We just have to think about b As the second letter , Narrow the scope and improve the pertinence , The model of such a tree is gradually clear .
It's like assuming that there is b,abc,abd,bcd,abcd,efg,hii this 6 Word , The tree we built is like this ：
ok, As shown in the figure above , For each node , The process from root to root is a word , If the node is marked as red color , It means that the word exists , Otherwise, it doesn't exist .
that , For a word , I just follow him from the root to the corresponding node , See if this node is marked in red to see if it has appeared . Mark this node in red , It's like inserting the word .
”
Step two 、TOP K Algorithm statistics hot words
When each search engine enters a prefix , It will only show 0~10 A candidate word , But if there are many candidates , How to choose , Which are shown in the front , Which are shown in the back ？ This is a question of search popularity .
As described in this title , At this time last year , When I search in the search box “ Beijing ” when , It will prompt with “ Beijing ” Prefixed such as “ Beijing love story ”,“ Beijing bus ”,“ Beijing Hospital ”, And “ Beijing love story ” Show in the first ：
Why input “ Beijing ”, Will first prompt “ Beijing love story ” Well ？ Because this time last year , It is 《 Beijing love story 》 When the TV play was on the air （ Its release date is 2012 year 1 month 8 Japan , Fire for at least a year ）, At that time, we all searched the relevant information of the TV play , When 10 Personal input “ Beijing ” after , Among them is 8 Individuals will continue to type in “ Love story ”（ All in all “ Beijing love story ”） When , Search engines are certainly not indifferent to this .
in other words , Search engines know this time period , Everyone is frantically searching for Beijing love story , Therefore, when the user input with “ Beijing ” When prefixed , Search engines guess that users have 80% The probability is to find “ Beijing love story ”, Therefore, the “ Beijing love story ” Here's a hint , And put it in the first place .
But why search again this time of year “ Beijing ” When , It shows different words ？
The reason is that over time , People are right. 《 Beijing love story 》 The attention of this TV play is decreasing , meanwhile , There are new hot words , Or a new movie , So now it's the same input “ Beijing ”, The words prompted later also changed accordingly . What's the way to solve this problem ？ As I said at the beginning ： Regularly analyze the keywords people search for in a certain period of time , Count the hot words with more searches , Then when the user enters a prefix , Give priority to hot words .
So to put it bluntly , The second step of this problem is to count hot words , We call the method of counting hot words TOP K Algorithm , This is the application scenario of this algorithm http://blog.csdn.net/v_july_v/article/details/7382693 No 2 A question , Again, I quote ：
“ Looking for hot queries ,300 The most popular query string among ten thousand 10 individual Inquire about
The original title is ： The search engine will record all the search strings used by the user every time through the log file , The length of each query string is 1255 byte . Let's say there are 10 million records （ These query strings have high repeatability , Although the total is 1 Ten million , But if the duplicate is removed , No more than 3 Million . The higher the repetition of a query string , It means that the more users query it , That is, the more popular ）, Please count the hottest 10 Query string , It is required that the memory used should not exceed 1G.
answer ： By the above 1 topic , We know , Big data is small , Like 100 million Ip seek Top 10, May first %1000 take ip be assigned to 1000 Go to... In a little file , And guarantee a kind of ip It only appears in one file , Then for each small file ip Conduct hashmap Count and sort by quantity , The last merge or minimum heap processes each small file in turn top10 To get the final result .
But if the data size itself is smaller , It can be loaded into memory at one time ？ For example, No 2 topic , Although there are ten million Query, But because of the high degree of repetition , So in fact, there is only 300 Ten thousand Query, Every Query255Byte, So we can think about putting them all in memory （300 Ten thousand strings assume no repetition , It's the maximum length , So it takes up the most memory 3M*1K/4=0.75G. So all strings can be stored in memory for processing ）, Now it just needs a proper data structure , ad locum ,HashTable It's definitely our priority .
So we give up divide and rule /hash Mapping steps , Go straight up hash Statistics , Then sort .So, For such Typical TOP K problem , The response is often ：hashmap + Pile up . As shown below ：
 hashmap Statistics ： First, preprocess the massive data . The way to do it is ： Maintain a Key by Query String ,Value For the sake of Query The number of times HashTable, namely hash_map(Query,Value), Read one at a time Query, If the string is not in Table in , Then add the string , And will Value Value to 1; If the string is in Table in , Then add one to the count of the string . In the end, we are O(N) Of time complexity Hash The table completes the statistics ;
 Pile up Sort ： The second step 、 With the data structure of heap , find Top K, The time complexity is N‘logK. With the help of the reactor structure , We can do it in log Search and adjust in time of magnitude / Move . therefore , Maintain a K( The topic is 10) A small pile of roots , Then traverse 300 Ten thousand Query, Compare with root element respectively . therefore , Our ultimate time complexity is ：O（N） + N' * O（logK）,（N by 1000 ten thousand ,N’ by 300 ten thousand ）.
Don't forget the idea of heap sorting described in this article ：‘ maintain k The smallest heap of elements , That is, the capacity used is k The smallest heap store of the first to traverse k Number , And assume that they are the biggest k Number , It takes time to build a pile O（k）, And adjust the pile ( Time consuming O（logk）) after , Yes k1>k2>...kmin（kmin Set as the minimum element in the small top heap ）. Continue traversing the sequence , One element at a time x, Compared with the top element , if x>kmin, Then update the heap （x Into the heap , when logk）, Otherwise, the heap will not be updated . So down , It takes time O（k*logk+（nk）*logk）=O（n*logk）. This method benefits from being in the pile , The time complexity of search and other operations is logk.’ The third chapter continued 、Top K The realization of algorithm problem .
Of course , You can also use trie Trees , The number of times the query string appears in the keyword field , There is no such thing as 0. Last use 10 The minimum push of elements to sort the frequency of occurrence .
”
Believe in , such , It is not difficult to understand the method proposed at the beginning ：Trie Trees + TOP K「hashmap+ Pile up ,hashmap+ Pile up The statistics are as follows 10 A similar hot word , in other words , Only the ones that are similar to the key words, such as 10 A hot word 」.
And you can tell your friends later , Why input “ Structure ”, There will be a bunch of hints to “ Structure ” A word with a prefix ：
The method seems to have taken shape , But what are the details that need attention ？ Such as @ Jiangshen _Johnson said ：“ In practice , For example, when the current patch is very short , When there are many candidates , There may be problems with query and sorting performance , Maybe we can add a layer of index trie（ This layer index can only index words whose frequency is higher than a certain threshold , Check this in a short time . If the number is not enough, check and index all the words trie Trees ）; And sometimes it can't be based on query Frequency , To guide users to input more comprehensive information query, Or maybe it's not just prefix matching .”
Extended reading
In addition to the above trie Trees , Trident tree may also be a good solution ：http://igoro.com/archive/efficientautocompletewithaternarysearchtree/. Besides ,StackOverflow There are also two discussion posts on , You can see that ：①http://stackoverflow.com/questions/2901831/algorithmforautocomplete,②http://stackoverflow.com/questions/1783652/whatisthebestautocompletesuggestalgorithmdatastructurecc.
Chapter 37 、 Search for nearby places
Topic details ： Find a point set closest to a given point , meanwhile , Given twodimensional point sets are fixed , There may be many queries , Time complexity O(n) unacceptable , Please design the data structure and the corresponding algorithm .
Topic analysis ： This is a threeway question from Microsoft last year , Like a friend @ Chen Li Ren The question that came out ： Search for nearby places , It is to search for places near users . With GPS And with GPS The popularity of functional mobile devices , Search for nearby places has become hot . Search for places in a huge geographic database , Index is very important . however , Our need is to search nearby places , for example , coordinate (39.91, 116.37) near 500 What kind of restaurant is there in Minai , So let you design , What to do ？
Solution 1 、R Tree 2D search
Suppose that only junior high school mathematics is allowed , Then you may build a XY Coordinate system , That is to coordinate (39.91, 116.37) For the center of a circle , With 500 The length of is radius , Draw a garden , Then one coordinate point to find . This seems to work , But the complexity can be imagined , Even if you think you're smart enough to divide the whole plane into four quadrants , A quadrant by quadrant search , This is not optimized enough , But it also means that you come up with ideas step by step .
That is, the search of different coordinate points , It's an area by area search , relatively speaking , Its average search speed and efficiency will be significantly improved . such , Then I naturally thought about whether there is a data structure that locates in an area at a time ？
If you read the introduction in the blog before R This article of the tree http://blog.csdn.net/v_JULY_v/article/details/6530142#t2 Our readers will immediately realize that ,R Tree is to solve the problem of finding and then reducing the scale of this area . Quote the original directly ：
“R The data structure of the tree
R Trees are B The expansion of trees in high dimensional space , It's a balance tree . Every R The leaf node of the tree contains several pointers to different data , The data can be stored in the hard disk , It can also be stored in memory . according to R This data structure of the tree , When we need to do a high dimensional spatial query , We only need to traverse the pointers contained in a few leaf nodes , Check whether the data pointed by these pointers meet the requirements . This way we don't have to traverse all the data to get the answer , Significant improvement in efficiency . The figure below 1 yes R A simple example of a tree ：
We said that above ,R Trees use the idea of space division , How does this idea come true ？R The tree uses a type called MBR(Minimal Bounding Rectangle) Methods , Here I translate it into “ Minimum bounding rectangle ”. Start with a rectangle from the leaf node （rectangle） Frame the space , Node up , The more space you frame , In order to divide the space . I don't understand ？ No problem , Keep looking down . I would also like to mention ,R In the tree R It should represent Rectangle（ For reference here wikipedia on R Trees Introduction to ）, Not what most domestic textbooks say Region（ A lot of books R Trees are called area trees , This is wrong ）. Let's take twodimensional space as an example . The picture below is Guttman A picture in the paper ：
Let me explain this picture in detail .
 Look at the picture first （b）, First of all, let's assume that all the data are points in twodimensional space , The picture just marks R8 Data in the area , That's the one shape of data object. Don't think of that irregular figure as a piece of data , We think of it as a region of data . In order to achieve R Tree structure , We use a minimum bounding rectangle to frame the irregular region , such , We've constructed an area ：R8.R8 The characteristics are obvious , Just frame all the data in this area .
 Other areas surrounded by solid lines , Such as R9,R10,R12 It's the same thing . thus , We got... Together 12 The most basic smallest rectangle . These rectangles will be stored in the child nodes .
 The next step is to do a higher level of processing . We found that R8,R9,R10 The three rectangles are the closest , So you can use a bigger rectangle R3 Just frame this 3 A rectangle .
 In the same way ,R15,R16 By R6 Just frame ,R11,R12 By R4 Just frame , wait . After all the most basic minimum bounding rectangles are framed into larger rectangles , Iterate again , Frame these rectangles with larger frames .
I think everyone should understand the characteristics of this data structure . Use the example of a map to explain , All the data is the location of the restaurant , First, divide the adjacent restaurants into the same area , After dividing all the restaurants , Then divide the adjacent areas into larger areas , After the division is completed, a higher level division will be carried out again , Until it's divided into only two largest areas . It's convenient to search .
Now we can store these large and small rectangles in our R In the tree . The root node holds the two largest rectangles , These two largest rectangles frame all the remaining rectangles , Of course, it frames all the data . The nodes of the next layer store the next largest rectangle , These rectangles narrow the range . Each leaf node is the smallest rectangle in storage , These rectangles may contain n Data .
Examples of map searches
After the basic data structure , Let's talk about an example , How to query specific data . Take the restaurant as an example , Suppose I want to find out the address of all the restaurants one kilometer away from Tianhe City, Tianhe District, Guangzhou ？
 Open the map （ That is the whole R Trees ）, Choose domestic or foreign first （ That is, the root node ）;
 Then choose South China （ Corresponding to the node of the first layer ）, Choose Guangzhou （ Corresponding to the second layer node ）,
 Then choose Tianhe District （ Corresponding to the third layer nodes ）;
 Finally, choose the area where Tianhe City is located （ Corresponding leaf node , Store the smallest rectangle ）;
Traverse all nodes in this region , To see if we can meet our requirements . What about? , Actually R The search rules of trees are very similar to those of maps ？ Corresponding to the figure below ：
A tree R Trees satisfy the following properties ：
 Unless it's outside the root , All leaf nodes contain m to M A record index （ entry ）. The number of records of leaf node as root node can be less than m. Usually ,m=M/2.
 For all records stored in leaves （ entry ）,I Is the smallest rectangle that can completely cover the points represented by these records in space （ Be careful ： Here say “ rectangular ” It can be extended to highdimensional space ）.
 Every non leaf node has m to M A child node , Unless it's the root .
 For each entry on a non leaf node ,i It is the smallest rectangle that can completely cover the store represented by these items in space （ Of the same nature 2）.
 All leaf nodes are on the same layer , therefore R A tree is a balance tree .
The structure of leaf nodes
Let's first explore the structure of leaf nodes . The data saved by the leaf node is in the form of ：(I, tupleidentifier).
among ,tupleidentifier Represents a... Stored in a database tuple, It's a record , It is n Dimensional .I It's a n The rectangles of dimensional space , And you can just frame all the records in this leaf node n Points in dimensional space .I=(I_{0},I_{1},…,I_{n1}). The structure shown below ：
The following figure describes the information to be stored by leaf nodes in twodimensional space .
In this picture ,I It represents the rectangle in the figure , Its scope is a<=I_{0}<=b,c<=I_{1}<=d. There are two tupleidentifier, These two points are shown in the figure . This form can be extended to high dimensional space . Just think about the threedimensional space . such , The structure of leaf node is introduced .
Non leaf node
The structure of non leaf node is very similar to leaf node . Imagine B The tree knows ,B The leaf node of the tree stores the real data , Instead of leaf nodes, they store the data “ The border ”, Or an index （ Readers with questions can review the first section above B Part of the tree ）.
In the same way ,R The data structure of the non leaf node of the tree is ：(I, childpointer).
among ,childpointer It's a pointer to a child's node ,I It's a rectangle covering all child nodes . It's a bit of a mouthpiece here , But I don't think it's hard to understand ？ Give me a picture ：
D,E,F,G Rectangle corresponding to child node .A For larger rectangles that can cover these rectangles . This A This is the rectangle corresponding to the non leaf node . At this time, you should realize ？ Whether leaf node or non leaf node , They all correspond to a rectangle . The rectangle corresponding to the node at the top of the tree structure can completely cover the rectangle corresponding to its child node . The root node also uniquely corresponds to a rectangle , And this rectangle can cover all the points represented by the data information we have in space .
I personally feel that this picture is not so accurate , It should be a rectangle A To cover exactly D,E,F,G, Instead of leaving so much useless space . But in order to respect the painter of the original , There is no modification .”
but R What's wrong with trees ？ Such as @ Song Xiao _CD said ：“ Simple to use R Trees to index , Search nearby places , It's possible to traverse many branches of the tree . And when maps of the whole country or the whole province , There are many leaf nodes in the tree , The depth of the tree can also be a problem . In general, the location of the nearby nodes （ Two dimensional map point line surface ） Pretreated to page( The size is 4K Multiple ), In these page On the establishment of R The index of the tree .”
Solution 2 、GeoHash Algorithms index geographic location information
When I discussed the problem of searching nearby points with some friends on Weibo , Apart from talking about R Trees , Several friends pointed out that GeoHash The algorithm can solve , That's why I got to know GeoHash Algorithm , This article http://blog.nosqlfan.com/html/1811.html Articulate MongoDB With the help of GeoHash Algorithm to achieve the principle of geographic location index , It's explained by quoting its contents , as follows ：
“ Support for geolocation indexing is MongoDB A highlight of , This is also the most popular LBS service foursquare choice MongoDB One of the reasons . We know , The usual database index structure is B+ Tree, How to turn a geographic location into a buildable B+Tree In the form of . Let's first assume that we divide the whole map that we need to index into 16×16 Of the lattice , Here's the picture （ The lower left corner is the coordinate 0,0 The upper right corner is the coordinate 16,16）：
pure ［x,y］ The data of cannot be indexed , therefore MongoDB When building an index , Will be calculated according to the coordinates of the corresponding field can be used to index hash value , This value is called geohash, Now we take the coordinates on the map as ［4,6］ The point of （ The red fork in the picture ） For example . The first step is to divide the whole map into four equal sized pieces , Here's the picture ：
We can define the values of these four blocks after dividing them into four blocks , as follows （ The bottom left is 00, On the top left is 01, At the bottom right is 10, On the top right is 11）：
db.map.ensureIndex({point : "2d"}, {min : 0, max : 16, bits : 4})
Readers comment on @yuotulck： First of all, thank you for your article , But if it's new （ For example, I ） notice geohash There may be misunderstandings ： Adjacency can be compared by prefixes ？ In fact, this is wrong , For example, the prefix of the adjacent area code of the boundary block is different from the first one , That is to say geohash Close points in hash Values are not necessarily close .
The knowledge above is from ：http://www.cnblogs.com/step1/archive/2009/04/22/1441689.html, and geohash We can learn more about ：
http://tech.idv2.com/2011/07/05/geohashintro/.
End of this chapter .
Reference links and recommended readings
 2012 September, October, 2010 written interview 80 questions ：http://blog.csdn.net/v_july_v/article/details/7974418;
 from Trie Trees （ Dictionary tree ） When it comes to suffix trees ：http://blog.csdn.net/v_july_v/article/details/6897097;
 Teach you how to kill quickly ：99% Mass data processing interview questions ：http://blog.csdn.net/v_july_v/article/details/7382693;
 from B Trees 、B+ Trees 、B* The tree said R Trees ：http://blog.csdn.net/v_july_v/article/details/6530142;
 The illustration MongoDB The principle of geographic location index ：http://blog.nosqlfan.com/html/1811.html;
 《Hbase actual combat 》 The first 8 Chapter 、 stay HBase Query GIS on ;
版权声明
本文为[osc_ o494ayqf]所创，转载请带上原文链接，感谢
https://chowdera.com/2020/12/20201207131536565o.html
边栏推荐
 C++ 数字、string和char*的转换
 C++学习——centos7上部署C++开发环境
 C++学习——一步步学会写Makefile
 C++学习——临时对象的产生与优化
 C++学习——对象的引用的用法
 C++编程经验（6）：使用C++风格的类型转换
 Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
 C + + number, string and char * conversion
 C + + Learning  capacity() and resize() in C + +
 C + + Learning  about code performance optimization
猜你喜欢

C + + programming experience (6): using C + + style type conversion

Latest party and government work report ppt  Park ppt

在线身份证号码提取生日工具

Online ID number extraction birthday tool

️野指针？悬空指针？️ 一文带你搞懂！

Field pointer? Dangling pointer? This article will help you understand!

HCNA Routing＆Switching之GVRP

GVRP of hcna Routing & Switching

Seq2Seq实现闲聊机器人

【闲聊机器人】seq2seq模型的原理
随机推荐
 LeetCode 91. 解码方法
 Seq2seq implements chat robot
 [chat robot] principle of seq2seq model
 Leetcode 91. Decoding method
 HCNA Routing＆Switching之GVRP
 GVRP of hcna Routing & Switching
 HDU7016 Random Walk 2
 [Code+＃1]Yazid 的新生舞会
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 HDU7016 Random Walk 2
 [code + 1] Yazid's freshman ball
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 Qt Creator 自动补齐变慢的解决
 HALCON 20.11：如何处理标定助手品质问题
 HALCON 20.11：标定助手使用注意事项
 Solution of QT creator's automatic replenishment slowing down
 Halcon 20.11: how to deal with the quality problem of calibration assistant
 Halcon 20.11: precautions for use of calibration assistant
 “十大科学技术问题”揭晓！青年科学家50²论坛
 "Top ten scientific and technological issues" announced Young scientists 50 ² forum
 求反转链表
 Reverse linked list
 js的数据类型
 JS data type
 记一次文件读写遇到的bug
 Remember the bug encountered in reading and writing a file
 单例模式
 Singleton mode
 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
 In this world of N programming languages, is there a future for C + +?
 es6模板字符
 js Promise
 js 数组方法 回顾
 ES6 template characters
 js Promise
 JS array method review
 【Golang】️走进 Go 语言️ 第一课 Hello World
 [golang] go into go language lesson 1 Hello World