How to recursively crawl subpages with Scrapy

徘徊边缘 提交于 2019-12-22 18:36:27

问题


So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like:

  1. Category 1 name
    • Subcategory 1 name
      • data from this subcategory's page
    • Subcategory n name
      • data from this page
  2. Category n name
    • Subcategory 1 name
      • data from subcategory n's page

etc.

Eventually i want to be able to use this data with ElasticSearch

I barely have any experience with Scrapy and this is what I have so far (just scrapes the category names from the first page, I have no idea what to do from here)... From my research I believe I need to use a CrawlSpider but am unsure of what that entails. I have also been suggested to use BeautifulSoup. Any help would be greatly appreciated.

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            yield {
                'categories': i.css('a::text').extract_first()
            }

回答1:


Not familiar with ElasticSearch but I'd build a scraper like this:

class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            subcategory = i.css('Put your selector here') # This is where you select the subcategory url
            req = scrapy.Request(subcategory, callback=self.parse_subcategory)
            req.meta['category'] = i.css('a::text').extract_first()
            yield req

    def parse_subcategory(self, response):
        yield {
            'category' : response.meta.get('category')
            'subcategory' : response.css('Put your selector here') # Select the name of the subcategory
            'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory
        }

You collect the subcategory URL and send a request. The response of this request will be opened in parse_subcategory. While sending this request, we add the category name in the meta data.

In the parse_subcategory function you get the category name from the meta data and collect the subcategory data from the webpage.



来源:https://stackoverflow.com/questions/44293662/how-to-recursively-crawl-subpages-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!