当前位置:网站首页>Notes on Python cookbook 3rd (2.4): string matching and searching

Notes on Python cookbook 3rd (2.4): string matching and searching

2020-11-10 10:46:26 Giant ship

String matching and searching

problem

You want to match or search for text in a specific pattern

solution

If you want to match a literal string , So you usually just need to call the basic string method , such as str.find() , str.endswith() , str.startswith() Or something like that :

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> # Exact match
>>> text == 'yeah'
False
>>> # Match at start or end
>>> text.startswith('yeah')
True
>>> text.endswith('no')
False
>>> # Search for the location of the first occurrence
>>> text.find('no')
10
>>>

For complex matching, regular expressions and re modular . To explain the fundamentals of regular expressions , Suppose you want to match a date string in numeric format, such as 11/27/2012 , You can do that :

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no
>>>

If you want to use the same pattern to do multiple matches , You should precompile pattern strings into pattern objects first . such as :

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if datepat.match(text2):
... print('yes')
... else:
... print('no')
...
no
>>>

match() Always start with a string to match , If you want to find the pattern occurrence location of any part of the string , Use findall() Method to replace . such as :

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

When defining a regular form , Usually, parentheses are used to capture groups . such as :

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>>

Capturing packets can make later processing easier , Because the content of each group can be extracted separately . such as :

>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups()
>>>
>>> # Find all matches (notice splitting into tuples)
>>> text
'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> for month, day, year in datepat.findall(text):
... print('{}-{}-{}'.format(year, month, day))
...
2012-11-27
2013-3-13
>>>

findall() Method searches for text and returns all matches in the form of a list . If you want to return a match iteratively , have access to finditer() Instead of , such as :

>>> for m in datepat.finditer(text):
... print(m.groups())
...
('11', '27', '2012')
('3', '13', '2013')
>>>

Discuss

This paper describes the use of re Module to match and search text the most basic method . The core step is to use re.compile() Compiling regular expression strings , And then use match() , findall() perhaps finditer() Other methods .

When writing regular strings , A relatively common practice is to use raw strings such as r'(\d+)/(\d +)/(\d+)' . This string will not parse the backslash , This is useful in regular expressions . If not , You have to use two backslashes , similar '(\d+)/(\d+)/(\d+)' .

>>> m = datepat.match('11/27/2012abcdef')
>>> m
<_sre.SRE_Match object at 0x1005d27e8>
>>> m.group()
'11/27/2012'
>>>

If you want to match exactly , Make sure that your regular expression is $ ending , Like this :

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)$')
>>> datepat.match('11/27/2012abcdef')
>>> datepat.match('11/27/2012')
<_sre.SRE_Match object at 0x1005d2750>
>>>

Last , If you just do a simple text match / Search operation , You can skip the compilation part , Use it directly re Module level functions . such as :

>>> re.findall(r'(\d+)/(\d+)/(\d+)', text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>>

But here's the thing , If you're going to do a lot of matching and searching , It's best to compile regular expressions first , And then reuse it . Module level functions cache the most recently compiled schema , So it doesn't cost too much performance , But if you use precompiled mode , You'll reduce the search and some additional processing overhead .

版权声明
本文为[Giant ship]所创,转载请带上原文链接,感谢