当前位置:网站首页>Python crawler actual combat details: crawling home of pictures

Python crawler actual combat details: crawling home of pictures

2020-11-06 01:17:51 itread01

Preface

The text and pictures in this article are from the Internet , Just for learning 、 Communication use , It doesn't have any commercial use , The copyright belongs to the original author , If you have any problem, please contact us in time for handling

How to use python To implement a crawler ?

  • Simulation browser
    Request and access to website information
    Extract the information we want from the source data Data screening
    Store the screened data

What tools are needed to complete a crawler

  • Python3.6
  • pycharm Professional version

Target site

Home of pictures

https://www.tupianzj.com/

 

Crawler code

Import tool

python Self contained standard library

import ssl

 

System library Automatically create storage folder

import os

 

Download the package

import urllib.request

 

Network Library Third party package

import requests

 

Web page selector

from bs4 import BeautifulSoup

 

Default request https The website doesn't need certificate authentication

ssl._create_default_https_context = ssl._create_unverified_context

 

Simulation browser

headers = {
    'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
}

 

Automatically create a folder

if not os.path.exists('./ Illustration material /'):
    os.mkdir('./ Illustration material /')
else:
    pass

 

Request operation

url = 'https://www.tupianzj.com/meinv/mm/meizitu/'
html = requests.get(url, headers=headers).text

 

Do data extraction for the original data of the page

soup = BeautifulSoup(html, 'lxml')
images_data = soup.find('ul', class_='d1 ico3').find_all_next('li')
for image in images_data:
    image_url = image.find_all('img')
    for _ in image_url:
        print(_['src'], _['alt'])

 

Download

try:
    urllib.request.urlretrieve(_['src'], './ Illustration material /' + _['alt'] + '.jpg')
except:
    pass

 

Renderings

 

 

 

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢