当前位置:网站首页>Complete understanding of KMP from beginning to end (August 22, 2014)

Complete understanding of KMP from beginning to end (August 22, 2014)

2020-12-07 13:29:24 osc_ chra83mb

To understand thoroughly from beginning to end KMP

 

author :July
Time : Originally written in 2011 year 12 month ,2014 year 7 month 21 Friday night 10 spot Delete all and rewrite this article , Over the next half month, I kept improving . Later included in new book The method of programming : Interview and algorithm experience 》 The first 4.4 In the festival .

 

1. introduction

    Ben KMP The original text was originally written in 2 More than years ago 2011 year 12 month , Because of the first contact KMP, Confusion in thinking leads to confusion in writing . So I always wanted to find a chance to write down again KMP, But it's been hard for me KMP The understanding of is always not enough , Therefore, we have not revised this article for a long time .

    But recently, because of the opening of Algorithm class , The class is devoted to data structure 、 interview 、 Algorithm , I went back to this again KMP, In the comprehensive understanding of some netizens 、 And Cao Bo, two lecturer friends in the algorithm class 、 After Zou Bo's understanding , Yes 9 Zhang PPT, On Weibo . And then , In for a penny, in for a pound , Simply will PPT The contents of the above are arranged in this paper ( Later, the article became more complete , The contents are no longer nine PPT That's easy ).

    KMP It's not complicated , But most of the articles on the Internet ( Including this article 2011 Year version ) It's confusing . below , Let's start with the violence matching algorithm , And then it goes on to explain KMP The process of step 、next Simple solution of array Recurrence principle Code solving , Then based on the next Array matching , When it comes to finite state automata ,next Optimization of arrays ,KMP Time complexity analysis of , Finally, we will briefly introduce two KMP The extended algorithm of .

    This paper tries to give you the most complete and clear KMP, Hope more people will not be KMP Torment or entanglement , No longer confused by some confusing articles . Any questions , Feel free to leave a comment ,thanks.

 

2. Violent matching algorithm

    Suppose we are faced with such a problem now : There's a string of text S, And a pattern string P, Now look for P stay S Position in , How to find it ?

    If you use the idea of violence matching , And suppose that the text string now S Match to i Location , Pattern string P Match to j Location , Then there are :

  • If the current character matches successfully ( namely S[i] == P[j]), be i++,j++, Continue to match the next character ;
  • If it doesn't match ( namely S[i]! = P[j]), Make i = i - (j - 1),j = 0. It's equivalent to every time a match fails ,i to flash back ,j Be set to 0.

    The flow and internal logic of violence matching algorithm are clarified , We can write violence matching code , as follows :

int ViolentMatch(char* s, char* p)
{
	int sLen = strlen(s);
	int pLen = strlen(p);

	int i = 0;
	int j = 0;
	while (i < sLen && j < pLen)
	{
		if (s[i] == p[j])
		{
			//① If the current character matches successfully ( namely S[i] == P[j]), be i++,j++    
			i++;
			j++;
		}
		else
		{
			//② If it doesn't match ( namely S[i]! = P[j]), Make i = i - (j - 1),j = 0    
			i = i - j + 1;
			j = 0;
		}
	}
	// The match is successful , Return pattern string p In the text string s Position in , Otherwise return to -1
	if (j == pLen)
		return i - j;
	else
		return -1;
}

    for instance , If a text string is given S“BBC ABCDAB ABCDABCDABDE”, And mode string P“ABCDABD”, Now take the pattern string P To follow the text string S matching , The whole process is as follows :

    1. S[0] by B,P[0] by A, Mismatch , Execution section ② Orders :“ If it doesn't match ( namely S[i]! = P[j]), Make i = i - (j - 1),j = 0”,S[1] Follow P[0] matching , This is equivalent to the mode string moving one bit to the right (i=1,j=0)

    2. S[1] Follow P[0] It still doesn't match , Continue with the ② Orders :“ If it doesn't match ( namely S[i]! = P[j]), Make i = i - (j - 1),j = 0”,S[2] Follow P[0] matching (i=2,j=0), So the pattern string moves one bit to the right ( Constantly executing “ Make i = i - (j - 1),j = 0”,i from 2 Change to 4,j Always for 0)

    3. until S[4] Follow P[0] The match is successful (i=4,j=0), At this point, according to the idea of the above violence matching algorithm , Turn to section ① Orders :“ If the current character matches successfully ( namely S[i] == P[j]), be i++,j++”, Available S[i] by S[5],P[j] by P[1], And then S[5] Follow P[1] matching (i=5,j=1)

 

    4. S[5] Follow P[1] The match is successful , Continue with the ① Orders :“ If the current character matches successfully ( namely S[i] == P[j]), be i++,j++”, obtain S[6] Follow P[2] matching (i=6,j=2), Go on like this

 

    5. until S[10] For the space character ,P[6] Character D(i=10,j=6), Because it doesn't match , Reexecution of ② Orders :“ If it doesn't match ( namely S[i]! = P[j]), Make i = i - (j - 1),j = 0”, amount to S[5] Follow P[0] matching (i=5,j=0)

 

    6. thus , We can see , If we follow the idea of violence matching algorithm , Although the text string and pattern string have been matched to S[9]、P[5], But because S[10] Follow P[6] Mismatch , So the text string goes back to S[5], Pattern strings go back to P[0], So that S[5] Follow P[0] matching .

    and S[5] Definitely with P[0] Mismatch . Why? ? Because before the first 4 Step matching , We have learned that S[5] = P[1] = B, and P[0] = A, namely P[1] != P[0], so S[5] It must not be equal to P[0], So looking back on the past is bound to lead to mismatches . Is there an algorithm , Give Way i Don't go back , Just move j Then ?

    The answer is yes . This algorithm is the main idea of this paper KMP Algorithm , It uses the valid information that has been partially matched before , keep i Don't go back , By modifying the j The location of , Let the pattern string move as far as possible to a valid position .

 

3. KMP Algorithm

3.1 Definition

    Knuth-Morris-Pratt String search algorithm , Referred to as “KMP Algorithm ”, Often used in a text string S Find a pattern string inside P Where is it , This algorithm is based on Donald Knuth、Vaughan Pratt、James H. Morris Three people in 1977 Published jointly in , So take this 3 People's surname naming this algorithm .

    Let's start with KMP Algorithm flow of ( If you feel a little uncomfortable , No problem , Hold on to it , There will be specific steps and explanations later , The more you look back, the better you'll see ):

  • Let's say that now the text string S Match to i Location , Pattern string P Match to j Location
    • If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++, Continue to match the next character ;
    • If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order i unchanged ,j = next[j]. This means mismatching , Pattern string P Relative to the text string S Moved right j - next [j] position .
      • In other words , When the match fails , The number of bits that the pattern string moves to the right is : The location of the mismatch character - The mismatch character corresponds to next value (next The solution of the array will be in the following 3.3.3 section In detail ), namely The actual number of digits moved is :j - next[j], And this value is greater than or equal to 1.

    Soon , You will also realize that next The meaning of the array values : Represents the string before the current character , What is the length of the same Prefix suffix . For example, if next [j] = k, representative j The length of the string before is the largest k The same Prefix suffix of .

    This also means that when a character is mismatched , This character corresponds to next The value will tell you that the next matching step is , Where should the pattern string jump to ( Jump to the next [j] The location of ). If next [j] be equal to 0 or -1, Jump to the beginning of the pattern string , if next [j] = k And k > 0, Represents that the next match jumps to j A character before , Instead of jumping to the beginning , And specifically skip k Characters .

    Convert to code to represent , It is :

int KmpSearch(char* s, char* p)
{
	int i = 0;
	int j = 0;
	int sLen = strlen(s);
	int pLen = strlen(p);
	while (i < sLen && j < pLen)
	{
		//① If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++    
		if (j == -1 || s[i] == p[j])
		{
			i++;
			j++;
		}
		else
		{
			//② If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order  i  unchanged ,j = next[j]    
			//next[j] That is to say j The corresponding next value       
			j = next[j];
		}
	}
	if (j == pLen)
		return i - j;
	else
		return -1;
}

    Let's continue with the previous example , When S[10] Follow P[6] When the match fails ,KMP It's not as simple as violence matching to move the pattern string one bit to the right , It's about carrying out the first ② Orders :“ If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order i unchanged ,j = next[j]”, namely j from 6 Change to 2( Later we will seek P[6], That is, the characters D Corresponding next The value is 2), So the number of bits that the pattern string moves to the right is j - next[j](j - next[j] = 6-2 = 4).

    To the right 4 Behind you ,S[10] Follow P[2] Continue matching . Why move to the right 4 Who? , Because mobile 4 Behind you , There's another... In the pattern string “AB” You can keep up with S[8]S[9] Corresponding , So you don't have to let i to flash back . It's equivalent to removing characters D Look for the same prefix and suffix in the substring of the pattern string , And then according to the prefix and suffix next Array , Based on the next Array to match ( No concern next How to get the array , Just want to see what the matching process looks like , You can jump directly to the following 3.3.4 section ).

3.2 step

  • ① Find Prefix suffix longest common element length
    • about P = p0 p1 ...pj-1 pj, Looking for pattern strings P The largest and equal prefix and suffix in . If there is p0 p1 ...pk-1 pk = pj- k pj-k+1...pj-1 pj, So in contain pj The maximum length of the pattern string is k+1 The same Prefix suffix of . for instance , If the given pattern string is “abab”, Then the maximum length of the common element of the Prefix suffix of each substring is shown in the table below :

For example, for Strings aba Come on , It has a length of 1 The same Prefix suffix of a; And for Strings abab Come on , It has a length of 2 The same Prefix suffix of ab( The length of the same Prefix suffix is k + 1,k + 1 = 2).

  • ② seek next Array
    • next Array considers the Longest Prefix suffix except the current character , So through section ① Step to get the maximum length of the common element of each Prefix suffix , Just a little bit of deformation : Will be the first ① The value obtained in the step is shifted to the right as a whole , Then the initial value is assigned as -1, As shown in the table below :

For example, for aba Come on , The first 3 Characters a Previous string ab There is a length of 0 The same Prefix suffix of , So the first 3 Characters a Corresponding next The value is 0; And for abab Come on , The first 4 Characters b Previous string aba There is a length of 1 The same Prefix suffix of a, So the first 4 Characters b Corresponding next The value is 1( The length of the same Prefix suffix is k,k = 1).

  • ③ according to next Array to match
    • Match mismatch ,j = next [j], The number of bits that the pattern string moves to the right is :j - next[j]. In other words , When the suffix of the pattern string pj-k pj-k+1, ..., pj-1 With the text string si-k si-k+1, ..., si-1 The match is successful , but pj Follow si When the match fails , because next[j] = k, Equivalent to the It doesn't contain pj The maximum length of the pattern string is k The same Prefix suffix of , namely p0 p1 ...pk-1 = pj-k pj-k+1...pj-1, Order j = next[j], To move the pattern string to the right j - next[j] position , Make the prefix of the pattern string p0 p1, ..., pk-1 Corresponding to the text string si-k si-k+1, ..., si-1, And then let pk Follow si Continue matching . As shown in the figure below :

    Sum up ,KMP Of next An array is equivalent to telling us : When a character in a pattern string matches a character in a text string , Which position should the pattern string jump to next . As in the pattern string j The character at and the text string at i When the character matching mismatch at , The next step is to use next [j] The character at continues to follow the text string i Where the characters match , This is equivalent to the pattern string moving to the right j - next[j] position .

    Next , Explain the above in detail 3 A step .

 

3.3 explain

3.3.1 Find the Longest Prefix suffix

    If the given pattern string is :“ABCDABD”, Traverse the entire pattern string from left to right , The prefixes and suffixes of each substring are shown in the table below :

    in other words , The maximum length table of the common elements of each prefix and suffix corresponding to the substring of the original pattern string is ( The abbreviation 《 Maximum length table 》):

 

3.3.2 be based on 《 Maximum length table 》 matching

    Because there may be repeated characters at the beginning and end of the pattern string , Therefore, the following conclusions can be drawn :

When it's mismatched , The number of bits that the pattern string moves to the right is : Number of matched characters - The maximum length of the last character of the mismatch character

    below , Let's combine the previous 《 Maximum length table 》 And the above conclusion , Match strings . If a text string is given “BBC ABCDAB ABCDABCDABDE”, And mode string “ABCDABD”, Now we're going to match pattern strings with text strings , As shown in the figure below :

 

  • 1. Because the characters in the pattern string A With the characters in the text string B、B、C、 The spaces don't match at first , So don't think about the conclusion , Just move the pattern string to the right one bit continuously , Until the character in the pattern string A With the first of the text string 5 Characters A The match is successful :

  • 2. Continue to match back , When the last character of the pattern string D Mismatch when matching with text string , Obvious , The pattern string needs to move to the right . But how many bits to the right ? Because the number of characters that have been matched is 6 individual (ABCDAB), And then according to 《 Maximum length table 》 Can get mismatch characters D The last character of B The corresponding length value is 2, So according to the previous conclusion , You need to move right 6 - 2 = 4 position .

  • 3. The mode string moves to the right 4 Behind you , Find out C It's a mismatch again , Because it's already matched 2 Characters (AB), And the last character B The corresponding maximum length is 0, So move to the right :2 - 0 =2 position .

  • 4. A Mismatch with spaces , To the right 1 position .

  • 5. Continue to compare , Find out D And C Mismatch , So the number of digits moving to the right is : Number of characters matched 6 Subtract the last character B The corresponding maximum length 2, That is to move right 6 - 2 = 4 position .

  • 6. Experience No 5 Step after , The match was found to be successful , End of the process .

    From the above matching process, we can see that , The key to the problem is to find the same prefix and suffix with the largest length in the pattern string , After finding the maximum length of the common part of the prefix and suffix before each character in the pattern string , You can match based on this . And this maximum length is just next The meaning of the array .

3.3.3 according to 《 Maximum length table 》 seek next Array

    From above , We already know , character string “ABCDABD” The maximum common element length of each Prefix suffix is :

    and , According to this table, the following conclusions can be drawn

  • When it's mismatched , The number of bits that the pattern string moves to the right is : Matched Number of characters - The maximum length of the last character of the mismatch character

    When using this table to match the conclusion above , We found that , When matching to a character mismatch , In fact, there is no need to consider the current mismatched characters , What's more, every time we're mismatched , It is the maximum length corresponding to the last character of the mismatch character . such , It leads to next Array .

    Given string “ABCDABD”, We can get it next The array is as follows :

    hold next After the array is compared with the maximum length table obtained before , It's not hard to find out ,next An array is equivalent to “ Maximum length value ” Move one bit to the right as a whole , Then the initial value is assigned to -1. Aware of this , You will exclaim at the original next The solution of array is so simple : Is to find the Prefix suffix with the maximum symmetric length , Then move the whole right one bit , The initial value is assigned to -1( Of course , You can also directly Calculate the corresponding of a character next value , It is to see how long the same Prefix suffix is in the string before this character ).

    In other words , For a given pattern string :ABCDABD, And its maximum length next The arrays are as follows :

    According to the maximum length table next After array , Thus there are

When it's mismatched , The number of bits that the pattern string moves to the right is : Where is the mismatch character Location  - The mismatch character corresponds to next value

    Then , You'll find that , Whether it's based on 《 Maximum length table 》 The matching of , Or based on next Array matching , The number of digits to move to the right is the same . Why? ? because :

  • according to 《 Maximum length table 》, When it's mismatched , The number of bits the pattern string moves to the right = The number of characters that have been matched - The maximum length of the last character of the mismatch character
  • According to the 《next Array 》, When it's mismatched , The number of bits the pattern string moves to the right = The position of the mismatch character - The mismatch character corresponds to next value
    • among , from 0 When you start counting , The position of the mismatch character = The number of characters that have been matched ( Mismatch characters are not counted ), And the mismatch character corresponds to next value =  The maximum length of the last character of the mismatch character , Two phase comparison , The result must be exactly the same .

    therefore , You can take 《 Maximum length table 》 As a next The prototype of an array , Even think of it as next Arrays are OK, too , The difference is just how to use it .

3.3.4 Recursively by code next Array

    Next , Let's write the code and ask for it next Array .

    Based on previous understanding , We can see the calculation next The method of array can be recursion :

  • 1. If For value k, existing p0 p1, ..., pk-1 = pj-k pj-k+1, ..., pj-1, amount to next[j] = k.
    • What does this mean ? Its essence ,next[j] = k representative p[j] In the previous pattern string substring , There is a length of k The same prefix and suffix of . With this next Array , stay KMP Matching , When the pattern string j When the character mismatch at , The next step is to use next[j] The characters at continue to match the text string , This is equivalent to the pattern string moving to the right j - next[j] position .

for instance , Here's the picture , According to the pattern string “ABCDABD” Of next Array to know the mismatched characters D Corresponding next The value is 2, For the character D The length of the front is 2 The same prefix and suffix of ( The same Prefix suffix is “AB”), After mismatch , The pattern string needs to move to the right j - next [j] = 6 - 2 =4 position .

To the right 4 Behind you , The character in the pattern string C Continue to match the text string .

  • 2. The following question is : It is known that next [0, ..., j], How to find out next [j + 1] Well ?

    about P Before j+1 Sequence characters :

  • if p[k] == p[j], be next[j + 1 ] = next [j] + 1 = k + 1;
  • if p[k ] ≠ p[j], If at this time p[ next[k] ] == p[j ], be next[ j + 1 ] =  next[k] + 1, Otherwise continue to recurse prefix index k = next[k], And then repeat the process .  It's equivalent to the character p[j+1] There was no such thing as k+1 The prefix of "p0 p1, …, pk-1 pk" With the suffix “pj-k pj-k+1, …, pj-1 pj" equal , So is there another value t+1 < k+1, Prefix that makes the length smaller “p0 p1, …, pt-1 pt” Is equal to the suffix of smaller length “pj-t pj-t+1, …, pj-1 pj” Well ? If there is , So this t+1 That is next[ j+1] Value , This is equivalent to using what has been obtained next Array (next [0, ..., k, ..., j]) Conduct P String prefix followed by P String suffix matching .

    A general article or textbook may have mentioned this , But most beginners may not be able to understand the above solution well next The principle of arrays , So next , Let me highlight again .

    As shown in the figure below , Suppose a pattern string is given ABCDABCE, And known next [j] = k( amount to “p0 pk-1” = “pj-k pj-1” = AB, It can be seen that k by 2), Current requirements next [j + 1] How much ? because pk = pj = C, therefore next[j + 1] = next[j] + 1 = k + 1( It can be seen that next[j + 1] = 3). For the character E In the previous pattern string , There's a length k+1 The same Prefix suffix of .

    but If pk != pj Well ? explain “p0 pk-1 pk”  ≠ “pj-k pj-1 pj”. In other words , When pk != pj after , character E What is the length of the same Prefix suffix before ? Obviously , because C differ D, therefore ABC Follow ABD inequality , That is, the characters E The previous pattern string has no length of k+1 The same Prefix suffix of , You can't simply make :next[j + 1] = next[j] + 1 . therefore , We can only look for a shorter Prefix suffix .

    Combined with the picture above , If you can In prefix “ p0 pk-1 pk ” The recursive prefix index in the k = next [k], Find a character pk’ Also for the D, representative pk’ = pj, And meet p0 pk'-1 pk' = pj-k' pj-1 pj, Then the maximum length of the same Prefix suffix is k' + 1, thus next [j + 1] = k’ + 1 = next [k' ] + 1. Otherwise, there is no in the prefix D, It means that there is no same Prefix suffix ,next [j + 1] = 0.

    that Why recursive prefix index k = next[k], You can find the same Prefix suffix with shorter length ? It comes down to next The meaning of array . Let's take the prefix p0 pk-1 pk Go with the suffix pj-k pj-1 pj matching , If pk Follow pj Mismatch , The next step is to use p[next[k]] Go with pj Continue matching , If p[ next[k] ] Follow pj It still doesn't match , You need to Look for the same Prefix suffix with shorter length , The next step is to use p[ next[ next[k] ] ] Go with pj matching . This process is equivalent to self matching of pattern strings , So it's recursive k = next[k], Until you either find the same Prefix suffix with a shorter length , Either there is no shorter Prefix suffix . As shown in the figure below :

    About k = next [k] This problem , I'd like to add some comments from two readers of this article / Add :

  1. readers wudehua55555 Leave a message in the comments of this article , To help you understand from another angle :“  I always thought that bloggers were using recursion to find next The array is not clear , Why use k = next[k], After a careful look at the red, yellow and blue zoning map, I suddenly realized , Is to find p[k] Corresponding next[k], According to symmetry , Just judge again p[next[k]] And p[j] Equal or not , So k = next[k], The idea of recursion is just used here . In fact, I don't think we should fall into the recursive method in the beginning , Another way of thinking , Start with symmetry , It can be concluded directly that k = next[k], And that's just recursion . These are some personal views , Thank you very much for your analysis , Non computer students can also understand , Although from last night 9 Point to see. Now . happy .”
  2. Another reader OnlyotDN Some notes are specially made on the basis of the above figure , For your reference :

    therefore , Because eventually in the prefix ABC Not found in D, so E Of next The value is 0:

Suffix of pattern string :AB DE
Prefix of pattern string :AB C
The prefix is shifted two bits to the right :      ABC

    Read this , Some readers may have questions again , Can you name one character that can be found in the prefix D What about the example of ?OK, Let's look at a character that can be found in a prefix D Example , As shown in the figure below :

    Given the pattern string DABCDABDE, We got the characters very smoothly D Previous “DABCDAB” The maximum length of the same prefix and suffix is 0 0 0 0 1 2 3, But when traversing to the character D, The requirements include D Inside “DABCDABD” When the Longest Prefix suffix is the same , We found that pj The character at D Follow pk The character at C Dissimilarity , In other words , Prefix DABC Last character of C With the suffix DABD Last character of D inequality , So there's no such thing as a length of 4 The same Prefix suffix of .

    What shall I do? ? Since there is no length of 4 The same Prefix suffix of , We can look for the same Prefix suffix with shorter length , Final , Because of p0 There's also a character D,p0 = pj, therefore p[j] The corresponding length value is 1, amount to E Corresponding next The value is 1( That is, the characters E Previous string “DABCDABD” There is a length of 1 The same prefix and suffix of ).

    Sum up , It can be obtained by recursion next Array , The code is as follows :

void GetNext(char* p,int next[])
{
	int pLen = strlen(p);
	next[0] = -1;
	int k = -1;
	int j = 0;
	while (j < pLen - 1)
	{
		//p[k] Indicates prefix ,p[j] The suffix 
		if (k == -1 || p[j] == p[k]) 
		{
			++k;
			++j;
			next[j] = k;
		}
		else 
		{
			k = next[k];
		}
	}
}

    Recalculate with code “ABCDABD” Of next Array , In order to verify before passing “ The longest suffix length of the same prefix is shifted to the right by one bit , Then the initial value is assigned as -1” Got next Whether the array is correct , The calculation results are shown in the table below :

    It can be seen from the above table that , Whether it was passed before “ The longest suffix length of the same prefix is shifted to the right by one bit , Then the initial value is assigned as -1” Got next Array , Or through the code Recursive Calculation of next Array , The results are exactly the same .

3.3.5 be based on 《next Array 》 matching

    below , We are based on next Array to match .

 

    Or given Text string “BBC ABCDAB ABCDABCDABDE”, And mode string “ABCDABD”, Now we're going to match pattern strings with text strings , As shown in the figure below :

    Before the official match , Let's review the above again 2.1 Section KMP The matching process of the algorithm :

  • Let's say that now the text string S Match to i Location , Pattern string P Match to j Location
    • If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++, Continue to match the next character ;
    • If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order i unchanged ,j = next[j]. This means mismatching , Pattern string P Relative to the text string S Moved right j - next [j] position .
      • In other words , When the match fails , The number of bits that the pattern string moves to the right is : The location of the mismatch character - The mismatch character corresponds to next value , namely The actual number of digits moved is :j - next[j], And this value is greater than or equal to 1.
  • 1. At the beginning of the match
    • P[0] Follow S[0] Matching failure
      • So execute “ If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order i unchanged ,j = next[j]”, therefore j = -1, So it turns to execution “ If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++”, obtain i = 1,j = 0, namely P[0] Keep following S[1] matching .
    • P[0] Follow S[1] And it doesn't match ,j Again, it's equal to -1,i、j Keep increasing , thus P[0] Follow S[2] matching .
    • P[0] Follow S[2] After mismatch ,P[0] Follow again S[3] matching .
    • P[0] Follow S[3] And then the mismatch , until P[0] Follow S[4] The match is successful , Start executing the second half of this instruction :“ If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++”.
  • 2. P[1] Follow S[5] The match is successful ,P[2] Follow S[6] Match success, too , ..., Until it matches P[6] The character at D Time mismatch ( namely S[10] != P[6]), because P[6] Situated D Corresponding next The value is 2, So the next step is to use P[2] The character at C Keep following S[10] matching , It's like moving to the right :j - next[j] = 6 - 2 =4 position .

  • 3. To the right 4 Behind you ,P[2] Situated C Again, it's mismatched , because C Corresponding next The value is 0, So the next step is to use P[0] The character at continues to follow S[10] matching , It's like moving to the right :j - next[j] = 2 - 0 = 2 position .

  • 4. After moving two bits ,A It doesn't match the space , The pattern string moves back 1 position .

  • 5. P[6] Situated D Again, it's mismatched , because P[6] Corresponding next The value is 2, So the next step is to use P[2] Continue to match the text string , This is equivalent to the pattern string moving to the right j - next[j] = 6 - 2 = 4 position .
  • 6. The match is successful , End of the process .

    The matching process is as like as two peas . It also proves from the side ,next The array does simply shift the length value of the common element of each maximum Prefix suffix by one bit to the right , And assign the initial value to -1 that will do .

3.3.6 be based on 《 Maximum length table 》 And based on 《next Array 》 Equivalent

    We already know , utilize next Array matching mismatch , The mode string moves to the right j - next [ j ] position , Equivalent to the number of matched characters  - The maximum length of the last character of the mismatch character . as a result of :

  1. j from 0 Start counting , So when you count to the mismatch character ,j Is the number of characters that have been matched ;
  2. because next The array is moved one bit to the right by the maximum length table ( And the initial value is assigned as -1) Got , Then the maximum length corresponding to the last character of the mismatch character , Is the current mismatch character next value .

    But why doesn't this article use next Array matching ? because next Array is not easy to find , The maximum length of the common element of the Prefix suffix of a string is easy to find . For example, if a pattern string is given “ababa”, I want you to figure it out quickly next Array , At first glance , Each time the corresponding character of next When the value of , You have to exclude the character , Then look at the maximum length of the same Prefix suffix in the string before the character , This process is not direct enough . And if you ask you to find the maximum length of the Prefix suffix public element , It's easy to get a direct result :0 0 1 2 3, As shown in the table below :

    Then this 5 A digital All of them move one bit to the right , And the initial value is assigned as -1, That is to say, to get it next Array :-1 0 0 1 2.

3.3.7 Next Arrays and finite state automata

    next Responsible for moving the pattern string forward , And when it comes to j When bits don't match , Use the first next[j] Bit and main string matching , It's like a punch “ surface ”. Besides ,next It can also be regarded as the state of finite state automata , In the case of how many characters have been read , After mismatch , The first few characters are useful .

3.3.8 Next Optimization of arrays

   Writing at this point , We have a comprehensive understanding of the idea of violence matching 、KMP Principle of algorithm 、 technological process 、 The internal logical connection between processes , as well as next Simple solution of array (《 Maximum length table 》 Move the whole right one bit , Then the initial value is assigned as -1) And code solving , Based on the 《next Array 》 The matching of , It looks like a sea of water , Clear and clear , But the above ignores a small problem .

    such as , If you use the previous next Array method to find pattern string “abab” Of next Array , You can get it next The array is -1 0 0 1(0 0 1 2 Move the whole right one bit , The initial value is assigned to -1), When it matches the text string in the figure below , Find out b Follow c Mismatch , So the pattern string shifts to the right j - next[j] = 3 - 1 =2 position .

    Move right 2 Behind you ,b Follow again c Mismatch . in fact , Because in the matching of the previous step , We have learned p[3] = b, And s[3] = c Mismatch , And after moving two bits to the right , Give Way p[ next[3] ] = p[1] = b Follow again s[3] When the match , There must be a mismatch . What's the problem ?

   

    The problem is that it shouldn't have happened p[j] = p[ next[j] ]. Why? ? The reason is that : When p[j] != s[i] when , The next match must be p[ next [j]] Follow s[i] matching , If p[j] = p[ next[j] ], It will inevitably lead to the failure of the next matching step ( because p[j] Already following s[i] Mismatch , And then you have to follow p[j] Equivalent value p[next[j]] Go with s[i] matching , Obviously , There must be a mismatch ), therefore You can't allow p[j] = p[ next[j ]]. If it does p[j] = p[ next[j] ] Do how? ? If it does , You need to recurse again , But even next[j] = next[ next[j] ].

    therefore , We have to revise our request next Array code .

// Optimized next  Array method 
void GetNextval(char* p, int next[])
{
	int pLen = strlen(p);
	next[0] = -1;
	int k = -1;
	int j = 0;
	while (j < pLen - 1)
	{
		//p[k] Indicates prefix ,p[j] The suffix   
		if (k == -1 || p[j] == p[k])
		{
			++j;
			++k;
			// Than before next Array method , The changes are below 4 That's ok 
			if (p[j] != p[k])
				next[j] = k;   // This was the only line before 
			else
				// Because there can't be p[j] = p[ next[j ]], So when it appears, you need to continue recursion ,k = next[k] = next[next[k]]
				next[j] = next[k];
		}
		else
		{
			k = next[k];
		}
	}
}

    Use optimized next Array method , The pattern string “abab” The new next The array is :-1 0 -1 0. Maybe some readers will ask : original next Array is Prefix suffix longest common element length value shift right one bit , Then the initial value is assigned as -1 And get , So the optimized next How can arrays be quickly worked out by heart ? actually , As long as we find the original next Array , We can base it on the original next Array quickly find the optimized next Array . Or to abab For example , As shown in the table below :

    

    As long as it appears p[next[j]] = p[j] The situation of , Then put next[j] The value of is recursive again . For example, in the pattern string “abab” Of the 2 individual a Of next When the value of , If it's not optimized next If it's worth it , The first 2 individual a Corresponding next The value is 0, It's equivalent to 2 individual a When it's mismatched , The next step is to match the pattern string with p[0] Situated a Match the text string again , There must be a mismatch . So, please 2 individual a Of next When the value of , Need to recurse again :next[2] = next[ next[2] ] = next[0] = -1( thereafter , According to the optimized new next It's worth knowing , The first 2 individual a When it's mismatched , perform “ If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++, Continue to match the next character ”), Empathy , The first 2 individual b Corresponding next The value is 0.

For optimized next Array can find a little bit : If the suffix of the pattern string is the same as the prefix , So their next The values are the same , For example, mode string abcabc, It has prefixes and suffixes abc, Its optimized next The array is :-1 0 0 -1 0 0, Prefix suffix abc Of next Values are -1 0 0.

    And then quote before 3.1 Chaste KMP Code :

int KmpSearch(char* s, char* p)
{
	int i = 0;
	int j = 0;
	int sLen = strlen(s);
	int pLen = strlen(p);
	while (i < sLen && j < pLen)
	{
		//① If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++    
		if (j == -1 || s[i] == p[j])
		{
			i++;
			j++;
		}
		else
		{
			//② If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order  i  unchanged ,j = next[j]    
			//next[j] That is to say j The corresponding next value       
			j = next[j];
		}
	}
	if (j == pLen)
		return i - j;
	else
		return -1;
}

    Next , Let's continue with the previous example , The whole matching process is as follows :

    1. S[3] And P[3] Matching failure .

    2. S[3] remain unchanged ,P The next matching position of is P[next[3]], and next[3]=0, therefore P[next[3]]=P[0] And S[3] matching .

    3.   Because in the last step P[0] And S[3] It still doesn't match . here i=3,j=next [0]=-1, Because the conditions are met j==-1, So execute “++i, ++j”, That is, the main string pointer moves down one position ,P[0] And S[4] Begin to match . Last j==pLen, Out of the loop , Output results i - j = 4( This is where the pattern string first appears in the text string ), The match is successful , The algorithm ends .

   

3.4 KMP Time complexity analysis of

    I believe that most readers read the above , I've found out that I understand KMP Very easy to , It's nothing more than to grasp the following points step by step :

  1. If the same prefix and suffix exist in the pattern string , namely pj-k pj-k+1, ..., pj-1 = p0 p1, ..., pk-1, So in pj Follow si After mismatch , Let the prefix of the pattern string p0 p1...pk-1 Corresponding to the text string si-k si-k+1...si-1, And then let pk Follow si Continue matching .
  2. It should have been pj Follow si matching , The result is a mismatch , After mismatch , Make pk Follow si matching , amount to j Turned into k, The mode string moves to the right j - k position .
  3. because k The value of is variable , So we use next[j] Express j After a character mismatch at , The next match pattern string should jump to . In other words , Before the mismatch, it was j,pj Follow si When it's mismatched , use p[ next[j] ] Keep following si matching , amount to j Turned into next[j], therefore ,j = next[j], It's equivalent to moving the pattern string to the right j - next [j] position .
  4. and next[j] How much should it be equal to ?next[j] The value of is determined by j The length of the same Prefix suffix in the previous pattern string substring is determined by , If j In the previous pattern string substring ( Not included j) There is a maximum length of k The same Prefix suffix of , that next [j] = k.

    As shown in the previous figure :

    Next , Let's analyze KMP Time complexity of . Before the analysis , Let's review KMP The flow of the matching algorithm :

KMP Algorithm flow of :

  • Let's say that now the text string S Match to i Location , Pattern string P Match to j Location
    • If j = -1, Or the current character matches successfully ( namely S[i] == P[j]), Du Ling i++,j++, Continue to match the next character ;
    • If j != -1, And the current character matching failed ( namely S[i] != P[j]), Then order i unchanged ,j = next[j]. This means mismatching , Pattern string P Relative to the text string S Moved right j - next [j] position .”

    We found that if a character matches successfully , The position of the first character of the pattern string remains unchanged , just i++、j++; If the match doesn't match ,i unchanged ( namely i Don't go back ), The pattern string will skip the matched next [j] Characters . The worst-case scenario for the whole algorithm is , When the first character of the pattern string is at i - j Only when the position is matched successfully , The algorithm ends .
    therefore , If the length of the text string is n, The length of the pattern string is m, Then the time complexity of the matching process is O(n), Count it next Of O(m) Time ,KMP The overall time complexity of is O(m + n).

 

4. Expand 1:BM Algorithm

    KMP Matching starts at the beginning of the pattern string , and 1977 year , The University of Texas Robert S. Boyer Professor and J Strother Moore Professor invented a new string matching algorithm :Boyer-Moore Algorithm , abbreviation BM Algorithm . The algorithm starts from the tail of pattern string , And have at worst O(N) Time complexity of . In practice , Than KMP The actual efficiency of the algorithm is high .

    BM The algorithm defines two rules :

  • Bad character rules : When a character in a text string does not match a character in a pattern string , We call this mismatch character in the text string a bad character , At this point, the pattern string needs to move to the right , The number of digits moved = Position of bad character in pattern string - The right most position of a bad character in the pattern string . Besides , If " Bad characters " Not included in the pattern string , Then the position on the far right is -1.
  • Good Suffix : When the characters are mismatched , Move the number backward = The position of the good suffix in the pattern string - The first time a good suffix appears on the pattern string , And if the good suffix doesn't reappear in the pattern string , Then for -1.

    Here's an example BM Algorithm . for example , Given the text string “HERE IS A SIMPLE EXAMPLE”, And mode string “EXAMPLE”, Now we want to find whether the pattern string is in the text string , If there is , Returns the position of the pattern string in a text string .

    1.  First ," Text string " And " Pattern string " Head alignment , Start with the tail ."S" And "E" Mismatch . At this time ,"S" This is called " Bad characters "(bad character), That is, mismatched characters , It corresponds to the first... Of the pattern string 6 position . And "S" Not included in pattern string "EXAMPLE" In ( It's equivalent to the position on the right -1), This means that the pattern string can be moved back 6-(-1)=7 position , And move directly to "S" The last one of them .

    2.  Still starting from the tail , Find out "P" And "E" Mismatch , therefore "P" yes " Bad characters ". however ,"P" Included in the pattern string "EXAMPLE" In . because “P” This “ Bad characters ” Corresponding to the first of the pattern string 6 position ( from 0 Numbered starting ), And the rightmost position in the pattern string is 4, therefore , Move the pattern string back 6-4=2 position , Two "P" alignment .

    3.  Compare in turn , obtain “MPLE” matching , be called " good-suffix "(good suffix), That is, all the strings that match the tails . Be careful ,"MPLE"、"PLE"、"LE"、"E" All good suffixes .

 

    4.  Find out “I” And “A” Mismatch :“I” Bad characters . If it's based on the bad character rule , At this point, the pattern string should be moved back 2-(-1)=3 position . The problem is , Is there a better way to move ?

    5. A better method is to make good use of suffix rules : When the characters are mismatched , Move the number backward = The position of the good suffix in the pattern string - The last time a good suffix appears in the pattern string , And if the good suffix doesn't reappear in the pattern string , Then for -1.
    be-all “ good-suffix ”(MPLE、PLE、LE、E) In , Only “E” stay “EXAMPLE” The head of , So move back 6-0=6 position .
    It can be seen that ,“ Bad character rules ” You can only move 3 position ,“ Good Suffix ” You can move 6 position . Move the larger of the two rules each time . The number of moving bits for these two rules , It's only about pattern strings , It has nothing to do with the original string .

    6.  Continue to compare from the tail ,“P” And “E” Mismatch , therefore “P” yes “ Bad characters ”, according to “ Bad character rules ”, Move backward 6 - 4 = 2 position . Because it's the last one that doesn't match , Not getting a good suffix .

    It can be seen from the above that ,BM The algorithm is not only efficient , And it's ingenious , Easy to understand .

 

5. Expand 2:Sunday Algorithm

    In this paper , We've already introduced it KMP Algorithm and BM Algorithm , Both algorithms have linear search times in the worst case . But actually ,KMP The algorithm is no better than the simplest c Library function strstr() How fast , and BM Although the algorithm is usually better than KMP Fast algorithm , but BM The algorithm is not the fastest algorithm among the existing string search algorithms , At the end of this paper, we introduce a kind of comparison BM Algorithm faster search algorithm, that is Sunday Algorithm .

    Sunday Algorithm by Daniel M.Sunday stay 1990 in , Its thought follows BM The algorithm is very similar :

  • It's just Sunday The algorithm matches from front to back , When matching fails, we focus on the next character of the last character in the string .
    • If the character does not appear in the pattern string, it will skip , That is, the number of moving digits = Match string length + 1;
    • otherwise , The number of moving bits  = The distance from the rightmost character to the end of the pattern string +1.

    Here is an example to illustrate Sunday Algorithm . Suppose you're going to be in the text string now "substring searching algorithm" Find pattern string in "search".

    1. At the beginning , Align the pattern string to the left of the text string :
substring searching algorithm
search
^
    2. It turns out that in the 2 A mismatch was found at characters , When not matching, pay attention to the next character of the last character in the text string , That is, bold characters i, Because pattern strings search Does not exist in i, So the pattern string just jumps over a big piece , Move the number of digits to the right = Match string length + 1 = 6 + 1 = 7, from i The character after that ( That is, the characters n) Start the next match , Here's the picture :



substring searching algorithm
    search
    ^
    3. It turns out that the first character doesn't match , Let's look at the next character of the last character in the text string , yes 'r', It appears at the bottom of the pattern string 3 position , So move the pattern string to the right 3 position (r The distance to the end of the pattern string + 1 = 2 + 1 =3), Make two 'r' alignment , as follows :
substring searching algorithm
      search
       ^





    4. The match is successful .

    Review the whole process , We only move the pattern string twice to find the matching position , Due to Sunday The amount of movement in each step of the algorithm is relatively large , It's very efficient . End .

 

6. reference

  1. 《 Introduction to algorithms 》 Chapter 12 : string matching ;
  2. The pattern string in this article “ABCDABD” Part of the graph is from this article :http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html;
  3. this paper 3.3.7 Section of the finite state automata diagram by micro blog users @ Gong Lu'an draw :http://d.pr/i/NEiz;
  4. Beijing 7 Zou Bo works half an hour in the summer vacation KMP video :http://www.julyedu.com/video/play/id/5;
  5. Beijing 7 In the second class of Zou Bo in the summer vacation in May PPT:http://yun.baidu.com/s/1mgFmw7u;
  6. understand KMP Of 9 Zhang PPT:http://weibo.com/1580904460/BeCCYrKz3#_rnd1405957424876;
  7. Detailed explanation KMP Algorithm ( More pictures ):http://www.cnblogs.com/yjiyjige/p/3263858.html;
  8. This article No 4 Part of the BM The algorithm is referred to in this paper :http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html;
  9. http://youlvconglin.blog.163.com/blog/static/5232042010530101020857;
  10. 《 data structure The second edition 》, YanWeiMin & Edited by WU Weimin ;
  11. http://blog.csdn.net/v_JULY_v/article/details/6545192;
  12. http://blog.csdn.net/v_JULY_v/article/details/6111565;
  13. Sunday The principle and implementation of the algorithm :http://blog.chinaunix.net/uid-22237530-id-1781825.html;
  14. Pattern matching Sunday Algorithm :http://blog.csdn.net/sunnianzhong/article/details/8820123;
  15. A piece of KMP The English introduction of :http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm;
  16. I 2014 year 9 month 3 Interview at Xi'an University of Electronic Science and technology on June & Algorithm lecture video ( The first 36 minute ~ The first 94 In minutes KMP):http://www.julyedu.com/video/play/21;
  17. A picture to understand KMP next How to find an array :http://v.atob.site/kmp-next.html

 

7. Postscript    

    I apologize for the confusion of the previous articles to the majority of readers , I'm glad that the new article will bring clarity to readers . I hope most beginners , Even a small number of non computer professional readers can understand this article . Any questions , You are welcome to criticize and correct at any time ,thanks.

    July、 At 9:00 p.m. on August 22, 2014 .

版权声明
本文为[osc_ chra83mb]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207132421109f.html