当前位置:网站首页>Public comments on catering data crawling (November 2020)

Public comments on catering data crawling (November 2020)

2020-11-10 17:23:23 Awesome.

One 、 Introduction to target data

​ Crawling the object for the public comment net Beijing area “ food ” Reference under the label “ popularity ” Automatically sorted 750 Restaurant data , Examples are as follows :

1.1 Property Value Description

​ The attribute value that needs to be crawled , As shown in the following table :

attribute name data type
Shop name title str
Star rating star float
Evaluation number review_num int
Per capita consumption cost int
features feature str
Address address str

1.2 Analysis of data arrangement rules

​ By browsing the public comments content page , You can find , Each page contains the most 15 Bar record , common 50 page . therefore , You may need to crawl through the loop function many times .

​ further , Notice the first page of URL The address is http://www.dianping.com/beijing/ch10/o2. It's easy to think ch10 and o2 Represents the data category , Look at the next page , You can see , On the second page URL by http://www.dianping.com/beijing/ch10/o2p2. So it can be inferred that ,350 page URL Should be */ch10/o2p3*/ch10/o2p50.

1.3 Data extraction path

​ Read the page source code through the background , You can see , Store record information on a single page is stored in ID by shop-list-all-list Of div In the label , As shown in the figure below , Every li The label is a store record .

​ It opens at li After label , You can see that the required attribute data is stored in class='txt' Of div tag .

​ By the name of the store title For example , You can see that the store name is stored in h4 In the sub tag .

​ alike , It can be found by recursion star、review_num、cost、feature、address The storage path of the attributes .

1.4 Public comments on anti climbing strategy

​ In this data crawling process, we found that , Public comments on the web crawler for data encryption , As shown below :

​ The data displayed on the web page is normal , RMB symbol '¥' Add the numbers to show per capita consumption .

​ However , In the source code displayed in the background , Some of the numbers are garbled , Can't read normally .

​ This is due to the anti climbing measures adopted by the public comments , If you read the sub tag information directly according to the conventional method , The data will also be displayed as random code . The solution will be presented later .

Two 、 Crawling process

2.1 requests Visit the target page and get HTML Source code

import requests
from lxml import etree

page_url = 'http://www.dianping.com/beijing/ch10/o2'
#  add to headers Information ,User-Agent Simulate normal browser access ,cookie For use when login authorization is required .
page_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36", "cookie": "your cookie"}

#  establish http link 
res_http = requests.get(page_url, headers=page_headers)
html = etree.HTML(res_http.text)
# res_http.text This is the web page html Source code , The data type is str. But it can't be resolved , Need to pass through lxml Modular etree.HTML The function resolves to xml Format , Create a tree structure data format , And the return value is assigned to html object 

2.2 analysis HTML Get sub tag information

#  After getting the parsed html After the message , Start loop parsing to get information about a single hotel 
#  In order to avoid the existence of a certain page, the number of hotel records is not 15, Need to check the number of hotel records on the current page 
#  Through the analysis of , May adopt <div class="tit"> String as detection tag 

import re #  Call the regular expression module to parse 

data = [] #  Store data as a list of dictionaries , Each record in the list is a dictionary data structure , On behalf of a hotel 
#  Get the number of hotels on the current page 
record_tags = re.findall(re.compile(r'<div class=\"tit\">'), res_http.text)
nRecord = len(record_tags)
for n in range(nRecord):
    ID = n + 1 #  Hotel record number 
    values = [] #  List of key values , Each cycle resets 
    values.append(getTitle(ID, html)) #  Extract the name of the store 
    values.append(getStar(ID, html)) #  Extract star rating 
    values.append(getReviewNum(ID, html, woff_dict_shopNum)) #  Extract comments 
    values.append(getCost(ID, html, woff_dict_shopNum)) #  Extract the average consumption per person 
    values.append(getFeature(ID, html, woff_dict_tagName)) #  Pick up the hotel label 
    values.append(getAddress(ID, html, woff_dict_address)) #  Extract address 
    values.append(getRecommend(ID,html)) #  Extract recommended dishes 
    #  Create a dictionary and store it in a data list 
    data.append(dict(zip(keys, values)))

​ getTitle And so on are user-defined functions , Specific operations used to perform parsing . With getTitle For example :

def getTitle(ID, html):
    return html.xpath('//*[@id="shop-all-list"]/ul/li['+str(ID)+']/div[2]/div[1]/a/h4')[0].text

​ here , We have been able to parse the relevant data of a single store on the page . combination 2.1 And 2.2, It is easy to build a complete data structure to achieve automatic crawling data . Print data object , As shown in the figure :

​ The picture above shows "\u" The starting string is the result of anti crawling parsing unicode code , Before going through a reverse climb , The output will be meaningless boxes , for example :

Next, we introduce the idea of anti climbing .

2.3 Anti creep

​ After searching for information , It was found that the public comments used Web Font anti crawling strategy . By creating custom Fonts , Changing a certain number of common characters Unicode code , Because the server side records the new font with Unicode The mapping between codes , The web side can identify changes Unicode The encoded characters are displayed normally . However , When a reptile crawls directly HTML Source code and access sub tag values , There is no corresponding font file locally , You can't parse it properly Unicode code , So it's shown as a box .

​ therefore , Just get the corresponding font file , And create local common characters and specific Unicode The mapping between codes , And then in 2.2 It can be replaced in the process of parsing .

2.3.1 The font file

​ Because the font of the front page is made by CSS The document decides , So from this point of view , find CSS file , It is possible to find the corresponding font file .

​ With address Attribute as an example , Browser background to find the corresponding code segment , As shown in the figure :

​ below Styles You can see class=address All of the tags depend on CSS file .

​ Open the file , You can find , It defines the “reviewTag”、“address”、“shopNum”、“tagName” The font file corresponding to the four types of labels , Pictured :

​ So we can analyze the CSS File to get the corresponding font file . Due to the irregular naming of the file , The probability is randomly generated , So we need to locate it in HTML Source code reference statements to get , as follows :

​ Obviously ,"svgtextcss" It can be used to locate the paragraph URL The only label for . With the help of regular expressions , You can start from HTML Extract it from the source code .

#  Grab woff file ,res_http by 2.1requests The module sends out get The return value of the request is ,.text Property is page HTML Source string 
woff_file = getWoff(res_http.text)

def getWoff(page_html):
    woff_files = []
    #  extract css file url
​    css_str = re.findall(re.compile(r'//.*/svgtextcss/.*\.css'), page_html)[0]
​    css_url = 'https:' + css_str
    # http visit css The file is parsed to get woff file 
​    res_css = requests.get(css_url).text
​    woff_urls = re.findall(re.compile(r'//s3plus.meituan.net/v1/mss_\w{32}/font/\w{8}\.woff'), res_css)
​    tags = ['tagName', 'reviewTag', 'shopNum', 'address']
​    for nNum, url in enumerate(woff_urls):
​        res_woff = requests.get('http:' + url)
​        with open('./resources/woff/'+tags[nNum]+'.woff', 'wb') as f:
​            f.write(res_woff.content)
​        woff_files.append('./resources/woff/'+tags[nNum]+'.woff')
​    return dict(zip(tags, woff_files))

​ Get in woff After the document , Can pass FontCreator Software on , As shown in the figure :

You can see , in total 603 Four common characters , Arrange and code in a certain order . therefore , Just get the characters and Unicode The mapping relationship of the code can replace the anti climbing font with the common font of the common encoding method .

2.3.2 Analysis of character mapping relation

​ In obtaining woff After the font file , You need to parse characters and Unicode The relationship between the codes . call fontTools The module can .

from fontTools.ttLib import TTFont

woff = TTFont(woff_file_URL) #  Read woff file 
# woff In file ID The number is 2~602 Of 601 Characters 
woff_str_601    = '1234567890 Shop, Zhongmei, Jiaguan, small car, big city, big city, wine shop, Guopin power generation, golden heart, business department, Chaosheng decoration garden, there are new limited Tian Mian work clothes, Haihua water house decoration, Chengle, Qixiang department, Lizi Laoyi flower specialty, dongrou cuisine, Xuefu fan, Baishan tea, Tongwei Suo, mountain medicine, yinnonglong, Shangang, Guangxin, Yirong, Nanju, Yuanxing, Xianji, opportunity to roast wenkangxin, Guoyang, Libao, baodadi, Eryi, xipi, Fangzhou, niujiahua, xiuai, Beiyang Building materials Sanhui jisehong station Dewang Guangming liyouyuantang Shaojiang Shehe xinghuoxing village zikekuaili private and living tongmingqiyanyubin fine house sold by the handhall of liner County, ShiShun, Shizhuang, with the shoes of Haoke huoya Shengti travel, spicy food bag, school fish, flat color, Shangcai bar, Baoyong, wuyongwucai, yizhengzaofengjiandian soup, wangqingjisi washing material, huimuyuan Jiama Union, Weichuan taise, Shifang, Yufeng, Yangtang, Lana, Lana, Gaochang Beipi whole female into Yunwei trade road yundukou Bohe River Ruihong Jingji road Xiangqing town Chupei Li Huilian Ma Honggang training film Jia Zhuchuang Bufu Paitou four Zhuangyuan shahenglong chungan cake family second tube Chengji huangxun taiyahao street intersection and fork next to the lane, Lane block ring provincial Bridge Lake section, xiangxia Fupu, yuanpo front building, Xiangqing Town, Xiangqing Town, Fenggang, Guangjing, quantang, Fangchang line Bay Ningjie baitiantingxi eighteen ancient double wins this single nine Ying taiyujindi late seven oblique period Wuling Songjiao Chaofeng liuzhenzhu Bureau Gangzhou hengbian Jijing office Han Dynasty Linnong Tuan waita Yang Tiepu Zi Nian Dao Ling original Mei Jin Rong you Hong Yang Yang Yang Yang Yan Shi Jin kailian Ding Xiu Liu Ji Zi Qi Zhang Gu is not very good. I can see that Chang Zhen you is not a good person, but I like it best But he was not happy with the price. He thought that he would make two rows and separate the sweetness. When the sweetness was full, he would recommend it to the hot end. When he saw a few more friends, he would buy some beans and milk. He would like to find some suitable eggs for him. He would like to take prawns as if they were in good condition and try the master's words Micro weekly value table patting and cake mixing '
# ['cmap'] For characters and Unicode List of mapping relationships for encoding 
woff_unicode    = woff['cmap'].tables[0].ttFont.getGlyphOrder()  #  obtain 603 It's a character unicode code 
woff_character = ['.notdef', 'x'] + list(woff_str_601) #  Add the number as 0、1 Two special characters of 
woff_dict = dict(zip(woff_unicode, woff_character))

​ Finally, it can be resolved to specific woff The mapping dictionary corresponding to the file woff_dict, Steps in 2.2 In the process of parsing, you can replace the anti climbing font with the common font .

2.3.3 additional

​ Because it contains font file information CSS Files are randomly generated , Its content order is not fixed , step 2.3.1 The parsing process has set the order of four types of tags fixed , in application , A more universal data structure is needed to extract font files correctly .

版权声明
本文为[Awesome.]所创,转载请带上原文链接,感谢