# Natural language processing - BM25 commonly used in search

2020-11-06 01:22:17

BM25 Algorithm is a common formula used to score correlation degree , The idea is simple , The main thing is to calculate a query The relevance of all the words and documents in it , Then I'm adding up the scores , And the relevance score of each word is mainly affected by tf/idf Influence .

BIM（ Binary hypothesis model ） For word features , Just consider whether the word is in doc There has been , Not considering the relevant characteristics of the word itself ,BM25 stay BIM On the basis of the introduction of words in the query weight , Words in doc Weights in , And some empirical parameters , therefore BM25 In practical application, the effect is much better than BIM Model .

## Concrete bm25

bm25 Algorithms are commonly used to calculate query Similar to the relevance of the article . In fact, the principle of this algorithm is very simple , Is what will need to be calculated query Divide words into w1,w2,…,wn, Then find out the relevance of each word and article , Finally, the correlation degree is accumulated , Finally, we can get the result of text similarity calculation . First Wi It means the first one i The weight of a word , We usually use TF-IDF Algorithm to calculate the weight of words, the second term of the formula R(qi,d) That means we query query Every word and article in d The relevance of , This one involves complex operations , Let's take a slow look at . Generally speaking Wi We usually use the inverse text frequency IDF Calculation formula ： In this formula ,N Represents the total number of documents ,n(qi) Indicates the number of articles containing the word , To avoid that the denominator in the logarithm is equal to 0, We add... To the denominator 0.5, This 0.5 It's called the adjustment coefficient , So when n(qi) When I was younger IDF The more it's worth , The greater the weight of the signifier .
Here's a chestnut ：“bm25” This word only appears in a few articles ,n(qi) It will be very small , that “bm25” Of IDF It's worth a lot ;“ We ”,“ yes ”,“ Of ” Such a word , Basically in every article there will be , that n(qi) It's very close to N, therefore IDF Value is very close to 0,

Then let's look at the second term in the formula R(qi,d), Now let's look at the formula for the second term ： In this formula , Generally speaking ,k1、k2 and b It's all regulatory factors ,k1=1、k2=1、b = 0.75,qfi Express qi In the query query Frequency of occurrence in ,fi Express qi In the document d Frequency of occurrence in , Because in general ,qi In the query query Only once will , So the qfi=1 and k2=1 Put it in the formula above , The latter one is equal to 1, In the end, you can get ： Let's see K, In fact, here K The value of is also an abbreviation of the formula , We put K Expand to see ： stay K In the expansion of dl Represents the length of the document ,avg(dl) Represents the average length of the document ,b Is the regulatory factor mentioned earlier , It can be seen from the formula that when the length of the article is fixed than the average length of the article , Regulatory factors b The bigger it is , The greater the weight of the article's length , On the contrary, the smaller . In regulating factors b Fixed time , When the length of the article is larger than the average length of the article , be K The bigger it is ,R(qi,d) The smaller . We put K The expansion of to bm25 Go to... In the formula ： That's all bm25 The flow of the algorithm .
Here is the implementation process ：