当前位置:网站首页>python_ scrapy_ Fang Tianxia
python_ scrapy_ Fang Tianxia
2020-11-08 08:04:32 【osc_x4ot1joy】
scrapy- Explain
xpath Select node
The common tag elements are as follows .
Mark | describe |
---|---|
extract | The extracted content is converted to Unicode character string , The return data type is list |
/ | Select from root node |
// | Match the selected current node, select the node in the document |
. | node |
@ | attribute |
* | Any element node |
@* | Any attribute node |
node() | Any type of node |
Climb to take the house world - Prelude
analysis
1、 website :url:https://sh.newhouse.fang.com/house/s/.
2、 Determine what data to crawl :1) Web address :page.2) Location name :name.3) Price :price.4) Address :address.5) Phone number :tel
2、 Analyze the web page .
open url after , We can see the data we need , Then you can see that there are still pagination .
You can see the opening url Then look at the page elements , All the data we need are in a pair of ul tag .
open li A couple of labels , What we need name Is in a Under the label , And there are unclear spaces around the text, such as line feed, need special treatment .
What we need price Is in 55000 Under the label , Be careful , Some houses have been bought without price display , Step on this pit carefully .
We can find the corresponding by analogy address and tel.
The pagination tag element shows , Of the current page a Of class="active". In opening the home page is a The text of is 1, It means the first page .
Climb to take the house world - Before the specific implementation process
First new scrapy project
1) Switch to the project folder :Terminal Input... On the console scrapy startproject hotel
,hotel It's the project name of the demo , You can customize it according to your own needs .
2) On demand items.py Folder configuration parameters . Five parameters are needed in the analysis , Namely :page,name,price,address,tel. The configuration code is as follows :
class HotelItem(scrapy.Item):
# The parameters here should correspond to the specific parameters of the crawler implementation
page = scrapy.Field()
name = scrapy.Field()
price = scrapy.Field()
address = scrapy.Field()
tel = scrapy.Field()
3) Build our new reptile Branch . Switch to spiders Folder ,Terminal Input... On the console scrapy genspider house sh.newhouse.fang.com
house Is the crawler name of the project , You can customize ,sh.newhouse.fang.com It's an area selection for crawling .
stay spider Under the folder we created house.py The file .
The code implementation and explanation are as follows
import scrapy
from ..items import *
class HouseSpider(scrapy.Spider):
name = 'house'
# Crawling area restrictions
allowed_domains = ['sh.newhouse.fang.com']
# The main page of crawling
start_urls = ['https://sh.newhouse.fang.com/house/s/',]
def start_requests(self):
for url in self.start_urls:
# Return the module name passed by the function , There are no brackets . It's a convention .
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
items = []
# Get the value displayed on the current page
for p in response.xpath('//a[@class="active"]/text()'):
# extract Convert the extracted content to Unicode character string , The return data type is list
currentpage=p.extract()
# Determine the last page
for last in response.xpath('//a[@class="last"]/text()'):
lastpage=last.extract()
# Switch to the nearest layer of tags .// Select the node in the document from the current node that matches the selection , Regardless of their location / Select from root node
for each in response.xpath('//div[@class="nl_con clearfix"]/ul/li/div[@class="clearfix"]/div[@class="nlc_details"]'):
item=HotelItem()
# name
name=each.xpath('//div[@class="house_value clearfix"]/div[@class="nlcd_name"]/a/text()').extract()
# Price
price=each.xpath('//div[@class="nhouse_price"]/span/text()').extract()
# Address
address=each.xpath('//div[@class="relative_message clearfix"]/div[@class="address"]/a/@title').extract()
# Telephone
tel=each.xpath('//div[@class="relative_message clearfix"]/div[@class="tel"]/p/text()').extract()
# all item The parameters in it have to do with us items The meaning of the parameters in it corresponds to
item['name'] = [n.replace(' ', '').replace("\n", "").replace("\t", "").replace("\r", "") for n in name]
item['price'] = [p for p in price]
item['address'] = [a for a in address]
item['tel'] = [s for s in tel]
item['page'] = ['https://sh.newhouse.fang.com/house/s/b9'+(str)(eval(p.extract())+1)+'/?ctm=1.sh.xf_search.page.2']
items.append(item)
print(item)
# When crawling to the last page , Class label last Automatically switch to the home page
if lastpage==' home page ':
pass
else:
# If it's not the last page , Continue crawling to the next page of data , Know all the data
yield scrapy.Request(url='https://sh.newhouse.fang.com/house/s/b9'+(str)(eval(currentpage)+1)+'/?ctm=1.sh.xf_search.page.2', callback=self.parse)
4) stay spiders Run the crawler under ,Terminal Input... On the console scrapy crawl house
.
The results are shown in the following figure
The overall project structure is shown on the right
tts The folder is used to store data on my side txt file . There is no need for this project .
If you find any errors, please contact wechat :sunyong8860
python Crawling along the road
版权声明
本文为[osc_x4ot1joy]所创,转载请带上原文链接,感谢
边栏推荐
- C++ 数字、string和char*的转换
- C++学习——centos7上部署C++开发环境
- C++学习——一步步学会写Makefile
- C++学习——临时对象的产生与优化
- C++学习——对象的引用的用法
- C++编程经验(6):使用C++风格的类型转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
猜你喜欢
-
C + + programming experience (6): using C + + style type conversion
-
Latest party and government work report ppt - Park ppt
-
在线身份证号码提取生日工具
-
Online ID number extraction birthday tool
-
️野指针?悬空指针?️ 一文带你搞懂!
-
Field pointer? Dangling pointer? This article will help you understand!
-
HCNA Routing&Switching之GVRP
-
GVRP of hcna Routing & Switching
-
Seq2Seq实现闲聊机器人
-
【闲聊机器人】seq2seq模型的原理
随机推荐
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing&Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+#1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11:如何处理标定助手品质问题
- HALCON 20.11:标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- “十大科学技术问题”揭晓!|青年科学家50²论坛
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- 求反转链表
- Reverse linked list
- js的数据类型
- JS data type
- 记一次文件读写遇到的bug
- Remember the bug encountered in reading and writing a file
- 单例模式
- Singleton mode
- 在这个 N 多编程语言争霸的世界,C++ 究竟还有没有未来?
- In this world of N programming languages, is there a future for C + +?
- es6模板字符
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World