当前位置：网站首页>Natural language processing  BM25 commonly used in search
Natural language processing  BM25 commonly used in search
20201106 01:22:17 【Elementary school students in IT field】
BM25 Algorithm is a common formula used to score correlation degree , The idea is simple , The main thing is to calculate a query The relevance of all the words and documents in it , Then I'm adding up the scores , And the relevance score of each word is mainly affected by tf/idf Influence .
About Bim
BIM（ Binary hypothesis model ） For word features , Just consider whether the word is in doc There has been , Not considering the relevant characteristics of the word itself ,BM25 stay BIM On the basis of the introduction of words in the query weight , Words in doc Weights in , And some empirical parameters , therefore BM25 In practical application, the effect is much better than BIM Model .
Concrete bm25
bm25 Algorithms are commonly used to calculate query Similar to the relevance of the article . In fact, the principle of this algorithm is very simple , Is what will need to be calculated query Divide words into w1,w2,…,wn, Then find out the relevance of each word and article , Finally, the correlation degree is accumulated , Finally, we can get the result of text similarity calculation .
First Wi It means the first one i The weight of a word , We usually use TFIDF Algorithm to calculate the weight of words, the second term of the formula R(qi,d) That means we query query Every word and article in d The relevance of , This one involves complex operations , Let's take a slow look at . Generally speaking Wi We usually use the inverse text frequency IDF Calculation formula ：
In this formula ,N Represents the total number of documents ,n(qi) Indicates the number of articles containing the word , To avoid that the denominator in the logarithm is equal to 0, We add... To the denominator 0.5, This 0.5 It's called the adjustment coefficient , So when n(qi) When I was younger IDF The more it's worth , The greater the weight of the signifier .
Here's a chestnut ：“bm25” This word only appears in a few articles ,n(qi) It will be very small , that “bm25” Of IDF It's worth a lot ;“ We ”,“ yes ”,“ Of ” Such a word , Basically in every article there will be , that n(qi) It's very close to N, therefore IDF Value is very close to 0,
Then let's look at the second term in the formula R(qi,d), Now let's look at the formula for the second term ：
In this formula , Generally speaking ,k1、k2 and b It's all regulatory factors ,k1=1、k2=1、b = 0.75,qfi Express qi In the query query Frequency of occurrence in ,fi Express qi In the document d Frequency of occurrence in , Because in general ,qi In the query query Only once will , So the qfi=1 and k2=1 Put it in the formula above , The latter one is equal to 1, In the end, you can get ：
Let's see K, In fact, here K The value of is also an abbreviation of the formula , We put K Expand to see ：
stay K In the expansion of dl Represents the length of the document ,avg(dl) Represents the average length of the document ,b Is the regulatory factor mentioned earlier , It can be seen from the formula that when the length of the article is fixed than the average length of the article , Regulatory factors b The bigger it is , The greater the weight of the article's length , On the contrary, the smaller . In regulating factors b Fixed time , When the length of the article is larger than the average length of the article , be K The bigger it is ,R(qi,d) The smaller . We put K The expansion of to bm25 Go to... In the formula ：
That's all bm25 The flow of the algorithm .
Here is the implementation process ：
版权声明
本文为[Elementary school students in IT field]所创，转载请带上原文链接，感谢
边栏推荐
 C++ 数字、string和char*的转换
 C++学习——centos7上部署C++开发环境
 C++学习——一步步学会写Makefile
 C++学习——临时对象的产生与优化
 C++学习——对象的引用的用法
 C++编程经验（6）：使用C++风格的类型转换
 Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
 C + + number, string and char * conversion
 C + + Learning  capacity() and resize() in C + +
 C + + Learning  about code performance optimization
猜你喜欢

C + + programming experience (6): using C + + style type conversion

Latest party and government work report ppt  Park ppt

在线身份证号码提取生日工具

Online ID number extraction birthday tool

️野指针？悬空指针？️ 一文带你搞懂！

Field pointer? Dangling pointer? This article will help you understand!

HCNA Routing＆Switching之GVRP

GVRP of hcna Routing & Switching

Seq2Seq实现闲聊机器人

【闲聊机器人】seq2seq模型的原理
随机推荐
 LeetCode 91. 解码方法
 Seq2seq implements chat robot
 [chat robot] principle of seq2seq model
 Leetcode 91. Decoding method
 HCNA Routing＆Switching之GVRP
 GVRP of hcna Routing & Switching
 HDU7016 Random Walk 2
 [Code+＃1]Yazid 的新生舞会
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 HDU7016 Random Walk 2
 [code + 1] Yazid's freshman ball
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 Qt Creator 自动补齐变慢的解决
 HALCON 20.11：如何处理标定助手品质问题
 HALCON 20.11：标定助手使用注意事项
 Solution of QT creator's automatic replenishment slowing down
 Halcon 20.11: how to deal with the quality problem of calibration assistant
 Halcon 20.11: precautions for use of calibration assistant
 “十大科学技术问题”揭晓！青年科学家50²论坛
 "Top ten scientific and technological issues" announced Young scientists 50 ² forum
 求反转链表
 Reverse linked list
 js的数据类型
 JS data type
 记一次文件读写遇到的bug
 Remember the bug encountered in reading and writing a file
 单例模式
 Singleton mode
 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
 In this world of N programming languages, is there a future for C + +?
 es6模板字符
 js Promise
 js 数组方法 回顾
 ES6 template characters
 js Promise
 JS array method review
 【Golang】️走进 Go 语言️ 第一课 Hello World
 [golang] go into go language lesson 1 Hello World