# Regular expression parsing, let you understand regular expression once

2020-12-07 09:03:57

Regular expressions are a hurdle that every programmer can't get around , It's also one of the most easily overlooked Arts , I believe most programmers use regularization in their work just like I did before , All are Google once , And then copy it and change it . Regular gives programmers the feeling that it is a tool that is difficult to master and use well . lately , In the crawler project a lot of use of regular , So , A lot of literature has been searched , The system of learning regular expressions , I made some summary to share with you .

Regular expressions use a single character to describe 、 The technique of matching a series of strings that conform to a certain syntactic rule . In many text editors , Regular expressions are often used for retrieval 、 Replace the text that matches a pattern . In programming projects , Regular is also used for text strings lookup 、 Replace 、 Cut and extract .

Almost all programming languages support regular expressions , But different programming languages have different rules and definitions of regularity , The discussion of regularity in this paper is based on python Linguistic

One 、 Five metacharacters

Metacharacters are special characters that have special meaning in regular expressions , Metacharacters are the basic components of regular expressions . A regular is a set of metacharacters , such as d Can be said 0-9 Any number between ,w It can express letters 、 Numbers 、 Any character in the underline . There are a lot of metacharacters in a regular , It can be divided into the following categories ：

1、 Special single character ： In English, . Any character other than a newline ,d Represents any single number ,w For letters 、 Numbers 、 Any character underlined ,s Any white space character , Accordingly D,W,S They are the negation of the corresponding lowercase .

2、 Blank character ： Except for the special characters , You're going to run into spaces when you're dealing with text 、 Line breaks and so on . In fact, when you write code, you will often use , A newline n,TAB tabs t

s Can represent all the above white space characters , Including Spaces

3、 quantifiers ： Special single characters and whitespace characters can only match a single character , But in the actual matching process, there will be several times 、 At least a few times 、 At most, there are so many rules , In this case, we need to use quantifiers , There are six kinds of quantifier metacharacters in regular ：* Express 0 Times or times ,+ Express 1 Times or times ,？ Express 0 Time or 1 Time ,{m}m Time ,{m,} At least m Time ,{m,n}m To n Time

4、 Range ： Previously, all metacharacters could only match a single character or a single character repetition , Sometimes you need to complete a string （ Such as matching good perhaps well Wait for a good word ） Match or multiple different single characters （ Like matching vowels aeiou）, In this case, the scope metacharacter is used , There are four main types of range metacharacters ：| Or operator , Any one of the two strings before and after the match can be used for string matching , such as ab|bc In matchable text ab and bc,[....] brackets [] For more than one , It can represent any single character in it , So any vowel can be used [aeiou] To express , The middle line in bracket can indicate the range [a-z] Can mean a To z Any letter of , If the first bracket is a caret （^）, Then it means not , It can't be any single element in it .

5、 Assertion ： Assertions are also called anchors , Used to define where characters appear , For example, to replace tom String is jerry, If you don't have to assert, then tomorrow The first three characters of will also be replaced , This is obviously wrong . There are three main types of assertions

`````` Before replacement ：tom asked me if I would go fishing with him tomorrow.
After replacement ：jerry asked me if I would go fishing with him jerryorrow.``````

5.1 Word boundaries ：b To show the boundaries of words ,

Using assertions to complete the above replacement case case case

``````str =" tom asked me if I would go fishing with him tomorrow."
re.sub(r"btomb","jerry",str)
After replacement ：jerry asked me if I would go fishing with him tomorrow.``````

5.2 The beginning of the line / end ： There are two sets of characters at the beginning and the end of the element ^ and \$,A and Z, The difference between the two is when matching to a multiline model 【 There will be an introduction later 】 when , The former matches the beginning and end of each line , The latter always matches the beginning and end of the entire string .

5.3 Looking around is also called zero money assertion ： In some scenarios, it is necessary to limit the left and right of the string to be matched , This uses a look around . For example, we need to deal with 11 When the phone number is matched ,12 Before the digit number 11 Bits will also be matched ,22 Bits will also be matched before 11 position , Now you need to look around , That is, neither left nor right can be numbers .

For looking around, the table above may be a little complicated , In fact, the essence is The left angle bracket stands for looking to the left , There is no angle bracket. Look to the right , Exclamation mark means right and wrong

Two 、 Two ways of quantifier matching

The last module has introduced six quantifier metacharacters , Actually {m,n} This form can completely replace ,+,？, among + and Because there are so many properties , We need to introduce greedy matching and non greedy matching , Let's start with an example

+ It's easy to understand the matching of the numbers , Will match four empty , Because Is the match 0 Times or more . The greedy model is introduced for infinite matching attributes ： Match as long as possible , The regular default is greedy mode and non greedy mode , Match as little as possible , The non greedy model is in + perhaps * Add a ？ That's all right. .

3、 ... and 、 Four matching patterns

The so-called matching pattern is the way to change the matching behavior of metacharacters themselves , It can be divided into four categories ：

1、 Ignore case mode ,（?i） such as Indistinguishes Case match a,python Provided in re.IGNORECASE Parameters can also be used to ignore case

2、. Point number universal distribution mode , Also known as one-way mode （?s）. The sign itself matches any character other than a newline , Using this mode, you can make . The function of the symbol matching any character is equivalent to [Ww]

3、 Multi line matching pattern , Usually ,^ Match the beginning of the entire string ,\$ Match the end of the entire string . What the multi line matching pattern changes is ^ and \$ The matching behavior of , This was introduced in the previous assertion ,（?m） Implement multiline mode , The multiline pattern matches the beginning and end of each line , This is useful in log analysis to identify each log line starting with time

4、 Annotation mode , That is to add comments to the regular expression , It is convenient for later maintenance and recovery （?#comment）

Four 、 The role of parentheses

（） It can be said that parentheses are the most frequently used in regularization , The most powerful class of symbols , The effect of regularization is good. What are the specific functions ？

Except for group references , Everything has been said before , This paper mainly introduces the group reference function .

1、 Grouping and numbering

Parentheses can be used to group in regular , The part enclosed in brackets “ subexpression ” Will be saved as a subgroup . What are the rules for grouping and numbering ？ It's very simple , In a word , The brackets are the groups . That may not be easy to understand , Let's give you an example . Here's a time format 2020-05-10 20:23:05. Suppose we want to use regularization to extract the date and time in it .

2、 Don't save groups

Those in brackets are saved as subgroups , But in some cases , You may just want to use brackets to see parts as a whole , Don't use it later , Similar to this , In actual use , There is no need to save subgroups . Now we can use it in brackets ？: Don't save subgroups . If there are parentheses in the regular form , So we say , This subexpression may be referenced again later , Therefore, not saving subgroups can improve the performance of regularization . Besides , There are also some benefits to this , Because there are fewer subgroups , Regular performance will be better , It is also less likely to make mistakes when counting subgroups . So what on earth is not saving subgroups ？ We can understand that , Brackets are used only for grouping , Think of a part as “ Single element ”, Don't assign numbers , I won't refer to this part later

3、 Parentheses are nested

I finished talking about subgroups and numbering , But some situations can be more complicated , For example, in the case of nested parentheses , We want to see how many groups of contents in some brackets should be handled ？ Don't worry about , In fact, the method is very simple , We just need to count the left bracket （ Open bracket ） It's the number one , We can determine which subgroup it is .

4、 Name groups

We talked about group numbering , But because of the number of the number in which position , Later, if we find that there is a problem with the regularization , Changed the number of brackets , It can also cause the number to change , So some programming languages provide named groups （named grouping）, It's easier to identify than numbers , Not easy to make mistakes . The format of the named group is (?P < Group name > Regular ).

5、 Backward reference

When you know the group reference number （number） after ,python in , We can use “ Anti oblique shoulder + Number ”, namely number The way to quote

**
5、 ... and 、 Four types of regular escape scenarios
**

1、 Escape character ： The backslash is python Escape characters in , After the character will change the original meaning of the character

2、 String escape and regular escape

Probably all programming tutorials have said that in regular expressions, you need to use four backslashes , It is estimated that many programmers may not know the underlying principle . From input string to The final regular expression goes through two escape processes . You can use r Avoid string escape , That is, matching with native strings

3、 Metacharacter escape in regular

Now if we're looking for something like an asterisk （*）、 plus （+）、 question mark （?） In itself , It's not a metacharacter function , In this case, we need to escape it , Just put a backslash in front of it

4、 The escape of brackets

In regular brackets [] and Curly braces {} Just escape the brackets , But parentheses () Both have to be transferred

``>>> import re>>> re.findall('()[]{}', '()[]{}')['()[]{}']>>> re.findall('()[]{}', '()[]{}')  #  Both square brackets and curly brackets can be escaped ['()[]{}']``

5、 Escape in a character group

The above describes the escape of metacharacters , There are three kinds of escape in character groups

5.1 The caret is in brackets , And it needs to be escaped in the first place

``>>> import re>>> re.findall(r'[^ab]', '^ab')  #  Before the translation, it stands for " Not "['^']>>> re.findall(r'[^ab]', '^ab')  #  After escape, it stands for common characters ['^', 'a', 'b']``

5.2 The middle line is in brackets , And it's not at the end of the line

``>>> import re>>> re.findall(r'[a-c]', 'abc-')  #  The middle line is in the middle , representative " Range "['a', 'b', 'c']>>> re.findall(r'[a-c]', 'abc-')  #  The middle line is in the middle , After the escape ['a', 'c', '-']>>> re.findall(r'[-ac]', 'abc-')  #  At the beginning , There is no need to escape ['a', 'c', '-']>>> re.findall(r'[ac-]', 'abc-')  #  At the end , There is no need to escape ['a', 'c', '-']``

5.3 The right bracket is in the middle bracket , And not in the first place

``​>>> import re>>> re.findall(r'[]ab]', ']ab')  #  The right bracket does not escape , In the first place [']', 'a', 'b']>>> re.findall(r'[a]b]', ']ab')  #  The right bracket does not escape , Not in the first place []  #  Can't match , Because the meaning is  a Keep up b]>>> re.findall(r'[a]b]', ']ab')  #  After escape, it stands for common characters [']', 'a', 'b']``

6、 Other metacharacters in a character group

Generally speaking, if we want to put metacharacters （.+?() And so on ） It means literally , It needs to be escaped , But if they appear in brackets in a character group , You can't escape . This situation , It's usually a single length metacharacter , For example, the dot （.）、 asterisk （）、 plus （+）、 question mark （?）、 Left and right parentheses, etc . They no longer have a special meaning , It's the character itself . But if it appears in brackets d or w When you wait for the sign , They are also the meaning of metacharacters themselves .

``>>> import re>>> re.findall(r'[.*+?()]', '[.*+?()]')  #  Single length metacharacters  ['.', '*', '+', '?', '(', ')']>>> re.findall(r'[d]', 'd12')  # w,d The function of metacharacter is in brackets ['1', '2']  #  Match the numbers , Instead of backslashes and letters d``

6、 ... and 、 A methodology

A methodology for regular expressions ： If there may be more than one character in a certain position , Just ⽤ Character set . If there are more than one string in a certain position , Just ⽤ Multiple choice structure . If the frequency of occurrence is uncertain , Just ⽤ quantifiers . If there's a requirement for where it appears , Just ⽤ Anchor lock position .

https://chowdera.com/2020/12/202012070903290805.html