安装
pip install Scrapy
创建项目
scrapy startproject weiboHotSearch
创建爬虫
cd weiboHotSearch scrapy genspider weibo s.weibo.com
编写Item
修改weiboHotSearch中的items.py,添加item
import scrapy class WeibohotsearchItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass keyword = scrapy.Field() url = scrapy.Field() count = scrapy.Field()
编写爬虫
修改
start_urls
,注意为list格式使用
xpath
解析数据xpath语法可参考https://www.w3school.com.cn/xpath/xpath_syntax.asp
解析数据时,可运行
scrapy shell "https://s.weibo.com/top/summary"
调试xpath引入
Item
,将数据以Itme
对象返回执行
scrapy crawl weibo
运行爬虫运行结果如下:
weibo.py
的完整代码
import scrapy from weiboHotSearch.items import WeibohotsearchItem class WeiboSpider(scrapy.Spider): name = 'weibo' allowed_domains = ['s.weibo.com'] start_urls = ['https://s.weibo.com/top/summary'] def parse(self, response): pass for i in response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]'): keyword = i.xpath('a/text()').extract_first() url = 'https://s.weibo.com'+i.xpath('a/@href').extract_first() count = i.xpath('span/text()').extract_first() print(keyword) print(count) # print(url) item = WeibohotsearchItem() item['keyword'] = keyword item['url'] = url item['count'] = count yield item
保存数据
- 使用
Feed export
保存
使用以下命令即可将数据保存到items.json
中
scrapy crawl weibo -o items.json cat items.json
使用
Item Pipeline
保存编写pipeline
修改
pipelines.py
,添加保存class WeibohotsearchPipeline(object): def __init__(self): self.f = open('items.csv','w') def process_item(self, item, spider): res = item['keyword']+','+item['count']+','+item['url']+"\n" self.f.write(res) return item
启用item pipline
将以下内容添加到
settings.py
中即可启用PiplineITEM_PIPELINES = { 'weiboHotSearch.pipelines.WeibohotsearchPipeline': 300, }
运行
scrapy crawl weibo cat items.csv
数据如下:
项目地址
https://gitee.com/yu-se/scrapy-test
参考文档
https://scrapy-chs.readthedocs.io/zh_CN/latest/
来源:https://www.cnblogs.com/lzyuid/p/12403151.html