使用 scrapy 爬取 微博热搜

╄→尐↘猪︶ㄣ 提交于 2020-03-03 17:16:32

安装

pip install Scrapy

创建项目

scrapy startproject weiboHotSearch

创建爬虫

cd weiboHotSearch
scrapy genspider weibo s.weibo.com

编写Item

修改weiboHotSearch中的items.py,添加item

import scrapy


class WeibohotsearchItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
    keyword = scrapy.Field()
    url = scrapy.Field()
    count = scrapy.Field()

编写爬虫

  1. 修改start_urls,注意为list格式

  2. 使用xpath解析数据

    xpath语法可参考https://www.w3school.com.cn/xpath/xpath_syntax.asp

    解析数据时,可运行scrapy shell "https://s.weibo.com/top/summary"调试xpath

  3. 引入Item,将数据以Itme对象返回

  4. 执行scrapy crawl weibo运行爬虫

    运行结果如下:

    weibo.py的完整代码

import scrapy

from weiboHotSearch.items import WeibohotsearchItem
class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    allowed_domains = ['s.weibo.com']
    start_urls = ['https://s.weibo.com/top/summary']

    def parse(self, response):
        pass
        for i in response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]'):
            keyword = i.xpath('a/text()').extract_first()
            url = 'https://s.weibo.com'+i.xpath('a/@href').extract_first()
            count = i.xpath('span/text()').extract_first()
            print(keyword)
            print(count)
            # print(url)
            item = WeibohotsearchItem()
            item['keyword'] = keyword
            item['url'] = url
            item['count'] = count
            yield item

保存数据

  • 使用Feed export 保存

使用以下命令即可将数据保存到items.json

scrapy crawl weibo -o items.json
cat items.json
  • 使用Item Pipeline保存

    1. 编写pipeline

      修改pipelines.py,添加保存

      class WeibohotsearchPipeline(object):
          def __init__(self):
              self.f = open('items.csv','w')
      
          def process_item(self, item, spider):
              res = item['keyword']+','+item['count']+','+item['url']+"\n"
      
              self.f.write(res)
              return item
    2. 启用item pipline

      将以下内容添加到settings.py中即可启用Pipline

      ITEM_PIPELINES = {
         'weiboHotSearch.pipelines.WeibohotsearchPipeline': 300,
      }
    3. 运行

      scrapy crawl weibo 
      cat items.csv

数据如下:

项目地址

https://gitee.com/yu-se/scrapy-test

参考文档

https://scrapy-chs.readthedocs.io/zh_CN/latest/

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!