当前位置:网站首页>(20201201 - problem solved) request crawler beautiful soup cannot extract tbody

(20201201 - problem solved) request crawler beautiful soup cannot extract tbody

2020-12-07 16:01:57 Quant_ Learner

  • Problem description

    In the crawler mission , What is needed is in :

    <table class="table_search_">
        <tbody>
            <tr>...</tr>
            <tr>...</tr>
            <tr>...</tr>
    

    You can go to table class="table_search_", But there's nothing you want . namely , Unable to extract tbody Content .

  • Problem analysis

    [ Reptiles ]xpath Unable to locate tbody label ( resolved )

    tbody It's not necessary ,Chrome Of Elements There must be... In the tab tbody( If the native page doesn't have ,chrome It will automatically add ),selenium The return is chrome Of Elements Content , So there must be tbody.

    and requests Is different , If the source html Not added in , Then there is no in the returned content .

  • Solution

    Deep understanding of reptiles : Web analytics || Review element

    From the above “ about network Understanding the role of bag grabbing ” In this part , It is understandable that table The content below is not directly returned by the current page html, So direct get Current page , It's impossible to parse the content .

    Need to pass through network Capture the package and analyze which link is given the required content , And then separate that link get.

    In this case headers In need :Referer, User-Agent two . About its meaning , See 《 understand http request headers in Referer||User-Agent||Cookie… The meaning of

  • References

  1. scrapy Of xpath Can't match tbody label
  2. xpath Parsing web pages tbody problem

版权声明
本文为[Quant_ Learner]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207154739428b.html