当前位置:网站首页>30. Integrate selenium operation with Google browser into scratch

30. Integrate selenium operation with Google browser into scratch

2020-11-10 18:04:50 A lion in the sky

Baidu cloud search , Search all kinds of information :http://www.lqkweb.com
Search the Internet disk , Search all kinds of information :http://www.swpan.cn

1、 Crawler file

dispatcher.connect() Signal distributor , The first parameter signal triggers the function , The second parameter is the trigger signal ,
signals.spider_closed It's the reptile end signal

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request,FormRequest
from selenium import webdriver                  #  Import selenium Module to operate browser software 
from scrapy.xlib.pydispatch import dispatcher   #  Signal distributor 
from scrapy import signals                      #  The signal 

class PachSpider(scrapy.Spider):                            # Define the crawler class , Must inherit scrapy.Spider
    name = 'pach'                                           # Set crawler name 
    allowed_domains = ['www.taobao.com']                    # Crawl the domain name 

    def __init__(self):                                                                                 # initialization 
        self.browser = webdriver.Chrome(executable_path='H:/py/16/adc/adc/Firefox/chromedriver.exe')    # Create Google browser objects 
        super(PachSpider, self).__init__()                                                              # Set to get the upper parent class base class ,__init__ Object encapsulation value in method 
        dispatcher.connect(self.spider_closed, signals.spider_closed)       #dispatcher.connect() Signal distributor , The first parameter signal triggers the function , The second parameter is the trigger signal ,signals.spider_closed It's the reptile end signal 

        # When running here , Will go to the middleware to execute ,RequestsChrometmiddware Middleware 

    def spider_closed(self, spider):                                        # Signal trigger function 
        print(' Reptile over   Stop crawling ')
        self.browser.quit()                                                 # Close the browser 

    def start_requests(self):    # start url function , Will replace start_urls
        return [Request(
            url='https://www.taobao.com/',
            callback=self.parse
        )]


    def parse(self, response):
        title = response.css('title::text').extract()
        print(title)

2、middlewares.py Middleware files

from scrapy.http import HtmlResponse


class RequestsChrometmiddware(object):              #  Browser access middleware 

    def process_request(self, request, spider):     #  rewrite process_request Request method 
        if spider.name == 'pach':                   #  Determine the name of the crawler as pach When the 
            spider.browser.get(request.url)         # Use Google browser to access url
            import time
            time.sleep(3)
            print(' visit :{0}'.format(request.url))  #  Print and visit the website 
            # Set response information , Returned by browser response information 
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding='utf-8', request=request)

3、settings.py Configuration file registration middleware

DOWNLOADER_MIDDLEWARES = {              # Open the registration middleware 
   'adc.middlewares.RequestsUserAgentmiddware': 543,
   'adc.middlewares.RequestsChrometmiddware': 542,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # Default UserAgentMiddleware Set to None
}

版权声明
本文为[A lion in the sky]所创,转载请带上原文链接,感谢