Beautiful Soup It's a HTML/XML The parser , The main function is also how to parse and extract HTML/XML data .
One 、 install
pip install beautifulsoup4
Two 、 Use
-
The import module
from bs4 import BeautifulSoup
-
establish BeautifulSoup object
In [1]: from bs4 import BeautifulSoup In [2]: text = ''' ...: <div> ...: <ul> ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> ...: <li class="item-1"><a href="link2.html">second item</a></li> ...: <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> ...: <li class="item-1"><a href="link4.html">fourth item</a></li> ...: <li class="item-0"><a href="link5.html">fifth item</a></li> ...: </ul> ...: </div> ...: ''' In [3]: bs = BeautifulSoup(text)# establish BeautifulSoup object , You can pass in a string directly
In [4]: bs1 = BeautifulSoup(open('./test.html'))# also File objects can be passed in
In [5]: bs Out[5]: <html><body><div> <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>establish Beautiful Soup Object time , You can pass in a string , You can also pass in file objects . It will be complicated HTML The document is transformed into a complex tree structure , And it automatically corrects the document , Like in the above example html and body node , Every node is Python object
-
obtain Tag object
In [6]: bs.ul # obtain ul Label content Out[6]: <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> In [7]: type(bs.ul) Out[7]: bs4.element.Tag In [8]: bs.li # obtain li Label content , Note that the first tag that meets the requirements is returned Out[8]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [12]: bs.ul.li.a # Overlay search tags Out[12]: <a href="link1.html">first item</a>
adopt Beautiful Soup The object is followed by ‘. Tag name ’ To get the tags you need to find , Superposition
-
Tag Object common properties
-
name----- Show tag name
In [13]: bs.name # Most of the time , You can put BeautifulSoup As Tag object , It's a special one Tag Out[13]: '[document]' In [14]: bs.li.name Out[14]: 'li
BeautifulSoup Object represents the content of a document . Most of the time , You can think of it as Tag object
-
attrs---- Display all the attributes of the tag in a dictionary
In [15]: bs.attrs Out[15]: {} In [16]: bs.li.attrs # Display all attributes in dictionary form Out[16]: {'class': ['item-0'], 'id': 'first'} In [17]: bs.li.attrs['id'] # Get a specific property method 1 Out[17]: 'first' In [18]: bs.li['id'] # Get specific property methods 2,'.attrs' Omission Out[18]: 'first' In [19]: bs.li.get('id')# Get specific Attribute method 3, utilize get Method Out[19]: 'first'
-
string---- Get the contents of the tag
In [20]: bs.li.string #li There's only one in the label a The label , that .string Will return to the innermost a Content of the label Out[20]: 'first item' In [21]: bs.li.a.string # return a Content of the label Out[21]: 'first item'
Be careful : If the tag content is a comment , The comment symbol is removed , such as “<!-- This is a comment -->”, Then return to " This is a comment "
-
contents---- Output the direct child node as a list , It also contains line breaks '\n'
In [22]: bs.ul.contents Out[22]: ['\n', <li class="item-0" id="first"><a href="link1.html">first item</a></li>, '\n', <li class="item-1"><a href="link2.html">second item</a></li>, '\n', <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, '\n', <li class="item-1"><a href="link4.html">fourth item</a></li>, '\n', <li class="item-0"><a href="link5.html">fifth item</a></li>, '\n']
-
chilldren---- Output the direct child node as a list generator , It also includes line breaks ‘\n
In [28]: bs.ul.children # Returns the list builder object Out[28]: <list_iterator at 0x7f2d9e90ea30> In [29]: for child in bs.ul.children: ...: print(child) ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
-
descendants---- It returns a generator object , When iterating values , It will recursively display all descendant nodes
In [30]: bs.ul.descendants # It returns a generator object , When iterating values , It will recursively display all descendant nodes Out[30]: <generator object Tag.descendants at 0x7f2d9e79fc80> In [31]: for d in bs.ul.descendants: ...: print(d) ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> <a href="link1.html">first item</a> first item <li class="item-1"><a href="link2.html">second item</a></li> <a href="link2.html">second item</a> second item <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <a href="link3.html"><span class="bold">third item</span></a> <span class="bold">third item</span> third item <li class="item-1"><a href="link4.html">fourth item</a></li> <a href="link4.html">fourth item</a> fourth item <li class="item-0"><a href="link5.html">fifth item</a></li> <a href="link5.html">fifth item</a> fifth item
-
-
Tag Common methods of objects
-
find(self, name=None, attrs={}, recursive=True, text=None,**kwargs)---------- Only the first matching object is returned
-
name Parameters ---- Filter tag names , You can pass in a string 、 Regular and lists 3 In the form of
In [32]: bs.find('li') # Find the first matching li label Out[32]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [33]: bs.find(['li','a']) # Find the first matching li Labels or a label Out[33]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [34]: import re In [35]: bs.find(re.compile(r'^l')) # Find the first one to l Label at the beginning ,li Tag matching Out[35]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [36]: bs.find(re.compile(r'l$')) # Find the first one to l The tag at the end ,html The label conforms to Out[36]: <html><body><div> <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
-
attrs Parameters ---- Filter properties ,dict type
In [37]: bs.find(attrs={'class':'item-1'}) # lookup class The attribute is item-1 The first label Out[37]: <li class="item-1"><a href="link2.html">second item</a></li>
-
recursive Parameters ---- If True, Indicates whether to recursively find the matching object from the descendant node . Otherwise, only the direct child nodes are searched
In [38]: bs.find('li',recursive=True) # Recursive search , Can match to li object Out[38]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [39]: bs.find('li',recursive=False) # From direct child nodes ( namely html) Cannot find in li label In [40]: bs.ul.find('li',recursive=False) #ul The direct child of is li label , So we can match Out[40]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
-
text Parameters ---- You can search for matching content in a document , and name Same parameter , There are strings 、 Regular 、 This is the list 3 In the form of
In [41]: bs.find(text='first item') # Find string , Full content needs to be passed in , Otherwise, it cannot be matched Out[41]: 'first item' In [42]: bs.find(text=re.compile(r'item'))# Find the first one that contains item The content of Out[42]: 'first item' In [43]: bs.find(text=re.compile(r'ir'))# Find the first one that contains ir The content of Out[43]: 'first item' In [44]: bs.find(text=['second item','third item']) # Search for second item or third item The first content of Out[44]: 'second item'
-
Other keyword parameters ---- The keyword is the property name , But attention can't pass in and python Keywords with duplicate names class attribute
In [45]: bs.find(id='first') #id Property as a keyword parameter Out[45]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [43]: bs.find(href='link4.html') #href Property as a keyword parameter Out[43]: <a href="link4.html">fourth item</a> In [44]: bs.find(class='item-inactive') # and python keyword class Duplicate class Property will report an error File "<ipython-input-42-a9ab4a3f6cee>", line 1 bs.find(class='item-inactive') ^ SyntaxError: invalid syntax
-
-
find_all(self, name=None, attrs={}, recursive=True, text=None,**kwargs)---- Returns all the objects that can be matched in the form of a list , The usage of all parameters is the same as find() Method
In [45]: bs.find_all('li') # Find all li label Out[45]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a></li>] In [46]: bs.find_all('li',attrs={"class":"item-1"}) # Find all li label , also class The attribute is item-1 Out[46]: [<li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>]
-
get() Method ---- Get specific properties of an object
In [47]: bs.li.get('class') #class Attributes because there can be multiple , So the return is in the form of a list Out[47]: ['item-0'] In [48]: bs.find(attrs={"class":"item-0"}).get('id') # Returns... As a string id Property value Out[48]: 'first' In [49]: bs.find_all('a')[1].get('href') Out[49]: 'link2.html'
-
get_text() Method ---- Get the contents of the tag , Same as string Property returns the same result
In [50]: bs.li.get_text() # Get the first one li The innermost content Out[50]: 'first item' In [51]: bs.find(attrs={"class":"bold"}).get_text() # obtain class The attribute is bold label ( namely span label ) What's in it Out[51]: 'third item' In [52]: bs.find_all('a')[3].get_text() # For the first 4 individual a What's in the label Out[52]: 'fourth item'
-
select() Method ----css Selectors , Same as find_all The method is a bit similar , Back to the list
-
Find... By tag name
In [53]: bs.select('li') # Find all li label Out[53]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a></li>]
-
Find... By class name , Before the class name, add '.'
In [54]: bs.select('.bold') # lookup class='bold' The label of Out[54]: [<span class="bold">third item</span>]
-
adopt id lookup ,id with '#'
In [55]: bs.select('#first') # lookup id by first The label of Out[55]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>]
-
Hybrid search
In [56]: bs.select('.item-0 a') # lookup class="item-0" Under the a label Out[56]: [<a href="link1.html">first item</a>, <a href="link5.html">fifth item</a>] In [57]: bs.select('#first a') # lookup id="first" Below a label Out[57]: [<a href="link1.html">first item</a>] In [58]: bs.select('ul span') # lookup ul Below span label Out[58]: [<span class="bold">third item</span>] In [59]: bs.select('ul>span') # On the back of the label ">" Represents a direct sub tag , because span The label is not ul Direct sub tags of , So it doesn't match Out[59]: [] In [60]: bs.select('a>span') #span The label is a Child of the tag , So it matches to Out[60]: [<span class="bold">third item</span>]
Direct sub tag lookup , Then use
>
Separate -
Search by property
In [61]: bs.select('li[class="item-inactive"]') # lookup class The attribute is 'item-inactive' Of li label Out[61]: [<li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>] In [62]: bs.select('a[href="link2.html"]') # lookup href The attribute is 'link2.html' Of a label Out[62]: [<a href="link2.html">second item</a>]
-
-