当前位置:网站首页>draft

draft

2021-01-24 00:39:37 Elite-Wang

Beautiful Soup It's a HTML/XML The parser , The main function is also how to parse and extract HTML/XML data .

One 、 install

pip install beautifulsoup4

Two 、 Use

  1. The import module

    from bs4 import BeautifulSoup
  2. establish BeautifulSoup object

    In [1]: from bs4 import BeautifulSoup
    
    In [2]: text = '''
       ...: <div>
       ...:     <ul>
       ...:         <li class="item-0" id="first"><a href="link1.html">first item</a></li>
       ...:         <li class="item-1"><a href="link2.html">second item</a></li>
       ...:         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
       ...:         <li class="item-1"><a href="link4.html">fourth item</a></li>
       ...:         <li class="item-0"><a href="link5.html">fifth item</a></li>
       ...:     </ul>
       ...: </div>
       ...: '''
    
    In [3]: bs = BeautifulSoup(text)# establish BeautifulSoup object , You can pass in a string directly 
    In [4]: bs1 = BeautifulSoup(open('./test.html'))# also File objects can be passed in
    In [5]: bs Out[5]: <html><body><div> <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>

    establish Beautiful Soup Object time , You can pass in a string , You can also pass in file objects . It will be complicated HTML The document is transformed into a complex tree structure , And it automatically corrects the document , Like in the above example html and body node , Every node is Python object

  3. obtain Tag object

    In [6]: bs.ul # obtain ul Label content 
    Out[6]: 
    <ul>
    <li class="item-0" id="first"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
    
    In [7]: type(bs.ul)
    Out[7]: bs4.element.Tag
    
    In [8]: bs.li # obtain li Label content , Note that the first tag that meets the requirements is returned 
    Out[8]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
    
    In [12]: bs.ul.li.a # Overlay search tags 
    Out[12]: <a href="link1.html">first item</a>

    adopt Beautiful Soup The object is followed by ‘. Tag name ’ To get the tags you need to find , Superposition

  4. Tag Object common properties

    1. name----- Show tag name

      In [13]: bs.name # Most of the time , You can put BeautifulSoup As Tag object , It's a special one  Tag
      Out[13]: '[document]'
      
      In [14]: bs.li.name
      Out[14]: 'li

      BeautifulSoup Object represents the content of a document . Most of the time , You can think of it as Tag object

    2. attrs---- Display all the attributes of the tag in a dictionary

      In [15]: bs.attrs
      Out[15]: {}
      
      In [16]: bs.li.attrs # Display all attributes in dictionary form 
      Out[16]: {'class': ['item-0'], 'id': 'first'}
      
      In [17]: bs.li.attrs['id'] # Get a specific property method 1
      Out[17]: 'first'
      
      In [18]: bs.li['id'] # Get specific property methods 2,'.attrs' Omission 
      Out[18]: 'first'
      
      In [19]: bs.li.get('id')# Get specific   Attribute method 3, utilize get Method 
      Out[19]: 'first'

       

    3. string---- Get the contents of the tag

      In [20]: bs.li.string #li There's only one in the label a The label , that  .string  Will return to the innermost a Content of the label 
      Out[20]: 'first item'
      
      In [21]: bs.li.a.string # return a Content of the label 
      Out[21]: 'first item'

      Be careful : If the tag content is a comment , The comment symbol is removed , such as “<!-- This is a comment -->”, Then return to " This is a comment "

    4. contents---- Output the direct child node as a list , It also contains line breaks '\n'

      In [22]: bs.ul.contents
      Out[22]: 
      ['\n',
       <li class="item-0" id="first"><a href="link1.html">first item</a></li>,
       '\n',
       <li class="item-1"><a href="link2.html">second item</a></li>,
       '\n',
       <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>,
       '\n',
       <li class="item-1"><a href="link4.html">fourth item</a></li>,
       '\n',
       <li class="item-0"><a href="link5.html">fifth item</a></li>,
       '\n']
    5. chilldren---- Output the direct child node as a list generator , It also includes line breaks ‘\n

      In [28]: bs.ul.children # Returns the list builder object 
      Out[28]: <list_iterator at 0x7f2d9e90ea30>
      
      In [29]: for child in bs.ul.children:
          ...:     print(child)
          ...: 
      
      
      <li class="item-0" id="first"><a href="link1.html">first item</a></li>
      
      
      <li class="item-1"><a href="link2.html">second item</a></li>
      
      
      <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
      
      
      <li class="item-1"><a href="link4.html">fourth item</a></li>
      
      
      <li class="item-0"><a href="link5.html">fifth item</a></li>

       

    6. descendants---- It returns a generator object , When iterating values , It will recursively display all descendant nodes

      In [30]: bs.ul.descendants # It returns a generator object , When iterating values , It will recursively display all descendant nodes 
      Out[30]: <generator object Tag.descendants at 0x7f2d9e79fc80>
      
      In [31]: for d in bs.ul.descendants:
          ...:     print(d)
          ...: 
      
      
      <li class="item-0" id="first"><a href="link1.html">first item</a></li>
      <a href="link1.html">first item</a>
      first item
      
      
      <li class="item-1"><a href="link2.html">second item</a></li>
      <a href="link2.html">second item</a>
      second item
      
      
      <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
      <a href="link3.html"><span class="bold">third item</span></a>
      <span class="bold">third item</span>
      third item
      
      
      <li class="item-1"><a href="link4.html">fourth item</a></li>
      <a href="link4.html">fourth item</a>
      fourth item
      
      
      <li class="item-0"><a href="link5.html">fifth item</a></li>
      <a href="link5.html">fifth item</a>
      fifth item
  5. Tag Common methods of objects

    1. find(self, name=None, attrs={}, recursive=True, text=None,**kwargs)---------- Only the first matching object is returned

      1. name Parameters ---- Filter tag names , You can pass in a string 、 Regular and lists 3 In the form of

        In [32]: bs.find('li') # Find the first matching li label 
        Out[32]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        
        In [33]: bs.find(['li','a']) # Find the first matching li Labels or a label 
        Out[33]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        
        In [34]: import re
        
        In [35]: bs.find(re.compile(r'^l')) # Find the first one to l Label at the beginning ,li Tag matching 
        Out[35]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        
        In [36]: bs.find(re.compile(r'l$')) # Find the first one to l The tag at the end ,html The label conforms to 
        Out[36]: 
        <html><body><div>
        <ul>
        <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
        </div>
        </body></html>
      2. attrs Parameters ---- Filter properties ,dict type

        In [37]: bs.find(attrs={'class':'item-1'}) # lookup class The attribute is item-1 The first label 
        Out[37]: <li class="item-1"><a href="link2.html">second item</a></li>
      3. recursive Parameters ---- If True, Indicates whether to recursively find the matching object from the descendant node . Otherwise, only the direct child nodes are searched

        In [38]: bs.find('li',recursive=True) # Recursive search , Can match to li object 
        Out[38]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        
        In [39]: bs.find('li',recursive=False) # From direct child nodes ( namely html) Cannot find in li label 
        
        In [40]: bs.ul.find('li',recursive=False) #ul The direct child of is li label , So we can match 
        Out[40]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
      4. text Parameters ---- You can search for matching content in a document , and name Same parameter , There are strings 、 Regular 、 This is the list 3 In the form of

        In [41]: bs.find(text='first item') # Find string , Full content needs to be passed in , Otherwise, it cannot be matched 
        Out[41]: 'first item'
        
        In [42]: bs.find(text=re.compile(r'item'))# Find the first one that contains item The content of 
        Out[42]: 'first item'
        
        In [43]: bs.find(text=re.compile(r'ir'))# Find the first one that contains ir The content of 
        Out[43]: 'first item'
        
        In [44]: bs.find(text=['second item','third item']) # Search for second item or third item The first content of 
        Out[44]: 'second item'
      5. Other keyword parameters ---- The keyword is the property name , But attention can't pass in and python Keywords with duplicate names class attribute

        In [45]: bs.find(id='first') #id Property as a keyword parameter 
        Out[45]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
        
        In [43]: bs.find(href='link4.html') #href Property as a keyword parameter 
        Out[43]: <a href="link4.html">fourth item</a>
        
        In [44]: bs.find(class='item-inactive') # and python keyword class Duplicate class Property will report an error 
          File "<ipython-input-42-a9ab4a3f6cee>", line 1
            bs.find(class='item-inactive')
                    ^
        SyntaxError: invalid syntax
    2. find_all(self, name=None, attrs={}, recursive=True, text=None,**kwargs)---- Returns all the objects that can be matched in the form of a list , The usage of all parameters is the same as find() Method

      In [45]: bs.find_all('li') # Find all li label 
      Out[45]: 
      [<li class="item-0" id="first"><a href="link1.html">first item</a></li>,
       <li class="item-1"><a href="link2.html">second item</a></li>,
       <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>,
       <li class="item-1"><a href="link4.html">fourth item</a></li>,
       <li class="item-0"><a href="link5.html">fifth item</a></li>]
      
      In [46]: bs.find_all('li',attrs={"class":"item-1"}) # Find all li label , also class The attribute is item-1
      Out[46]: 
      [<li class="item-1"><a href="link2.html">second item</a></li>,
       <li class="item-1"><a href="link4.html">fourth item</a></li>]
    3. get() Method ---- Get specific properties of an object

      In [47]: bs.li.get('class') #class Attributes because there can be multiple , So the return is in the form of a list 
      Out[47]: ['item-0']
      
      In [48]: bs.find(attrs={"class":"item-0"}).get('id') # Returns... As a string id Property value 
      Out[48]: 'first'
      
      In [49]: bs.find_all('a')[1].get('href')
      Out[49]: 'link2.html'
    4. get_text() Method ---- Get the contents of the tag , Same as string Property returns the same result

      In [50]: bs.li.get_text() # Get the first one li The innermost content 
      Out[50]: 'first item'
      
      In [51]: bs.find(attrs={"class":"bold"}).get_text() # obtain class The attribute is bold label ( namely span label ) What's in it 
      Out[51]: 'third item'
      
      In [52]: bs.find_all('a')[3].get_text() # For the first 4 individual a What's in the label 
      Out[52]: 'fourth item' 
    5. select() Method ----css Selectors , Same as find_all The method is a bit similar , Back to the list

      1. Find... By tag name

        In [53]: bs.select('li') # Find all li label 
        Out[53]: 
        [<li class="item-0" id="first"><a href="link1.html">first item</a></li>,
         <li class="item-1"><a href="link2.html">second item</a></li>,
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>,
         <li class="item-1"><a href="link4.html">fourth item</a></li>,
         <li class="item-0"><a href="link5.html">fifth item</a></li>]
      2. Find... By class name , Before the class name, add '.'

        In [54]: bs.select('.bold') # lookup class='bold' The label of 
        Out[54]: [<span class="bold">third item</span>]
      3. adopt id lookup ,id with '#'

        In [55]: bs.select('#first') # lookup id by first The label of 
        Out[55]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>]
      4. Hybrid search

        In [56]: bs.select('.item-0 a') # lookup class="item-0" Under the a label 
        Out[56]: [<a href="link1.html">first item</a>, <a href="link5.html">fifth item</a>]
        
        In [57]: bs.select('#first a') # lookup id="first" Below a label 
        Out[57]: [<a href="link1.html">first item</a>]
        
        In [58]: bs.select('ul span') # lookup ul Below span label 
        Out[58]: [<span class="bold">third item</span>]
        
        In [59]: bs.select('ul>span') # On the back of the label ">" Represents a direct sub tag , because span The label is not ul Direct sub tags of , So it doesn't match 
        Out[59]: []
        
        In [60]: bs.select('a>span') #span The label is a Child of the tag , So it matches to 
        Out[60]: [<span class="bold">third item</span>]

        Direct sub tag lookup , Then use  >  Separate

      5. Search by property

        In [61]: bs.select('li[class="item-inactive"]') # lookup class The attribute is 'item-inactive' Of li label 
        Out[61]: [<li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>]
        
        In [62]: bs.select('a[href="link2.html"]') # lookup href The attribute is 'link2.html' Of a label 
        Out[62]: [<a href="link2.html">second item</a>]

         

版权声明
本文为[Elite-Wang]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/01/20210124003846973J.html

随机推荐