# Pattern matching: The gestalt approach一种序列的文本相似度方法

2020-11-06 01:28:06

Reprint please indicate original ：https://blog.csdn.net/HHTNAN

Pattern matching: The gestalt approach

python Compare the similarity of two sequences , There is no need for a participle

Case study 1

``````import difflib
a=" Do uterine fibroids minimally invasive surgery with how much money "
b=" What does tinea cruris look like ? How to treat tinea cruris good ？"
print (difflib.SequenceMatcher(None,a,b).ratio())
``````

Output ：
0.06666666666666667

Case study 2

``````import difflib
a=" Do uterine fibroids minimally invasive surgery with how much money "
b=" Do uterine fibroids minimally invasive surgery specific costs "
print (difflib.SequenceMatcher(None,a,b).ratio())
``````

Output ：
0.769230769

Case study 3

``````import difflib
a=" Do uterine fibroids minimally invasive surgery with how much money "
b=" Specific cost to do uterine fibroids minimally invasive surgery "
print (difflib.SequenceMatcher(None,a,b).ratio())
``````

Output ：
0.6923076923076923

Case study 4

``````import difflib
a=" Do uterine fibroids minimally invasive surgery with how much money "
b=" Specific cost of uterine fibroids to do minimally invasive surgery "
print (difflib.SequenceMatcher(None,a,b).ratio())
``````

0.6153846153846154
Through the above case, we can see that the algorithm focuses on , It's sequence similarity . Will ignore the meaning of the subject 、 semantics .

The score returned by the algorithm is twice the number of sequence characters found by the algorithm divided by the total number of characters in two strings ; The score is returned as an integer , Reflect percentage match .

At present, the calculation formula of guessing algorithm is ,
If the positions in the sequence don't exactly match , Such as the case 3, Then the calculated score is 9/13,9 For the largest common string ,13 Is the total number of character sequences , Case study 4 by 8/13 Result , Understood as a 4+4/13 Result . So the question is why the case 2 The largest of 9 The score for the largest common string is so high , There should be a consistent score in one position +1. That is, the result is understood as 9+1/13 The result . The above conjectures are based on the test , It's not validated , It's not authoritative , I'll find the paper and read it later , Finishing again .（ It is worth noting that in the process of re-engineering, it is to B On the basis of characters .）
Case study 5
import difflib
a=“10 Anemia in a month old baby ”
b=“10 A month old baby has nosebleed ”
print (difflib.SequenceMatcher(None,a,b).ratio())
Output
0.8235294117647058

(7+8)+1/len（a）+len(b)=7*2/8+9=0.8235294117647058

Reprint please indicate original ：https://blog.csdn.net/HHTNAN

reference ： 【1】https://docs.python.org/2/library/difflib.html 【2】https://pymotw.com/2/difflib/ 【3】http://blog.chinaunix.net/uid-20780364-id-538761.html 【4】https://docs.python.org/3.5/library/difflib.html 【5】http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970 【6】https://blog.csdn.net/gavin_john/article/details/78951698