How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

久未见 提交于 2020-01-03 17:22:46

问题


I currently have a Spider-based spider that I wrote for crawling an input JSON array of start_urls:

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader

import json
import datetime
import re

class AtlanticFirearmsSpider(Spider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls='[]', *args, **kwargs):
      super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
      self.start_urls = json.loads(start_urls)

    def parse(self, response):
      l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
      product = l.load_item()
      return product

I can call it from the command line like so, and it does a wonderful job:

scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'

However, I'm trying to add a CrawlSpider-based spider for crawling the entire site that inherits from it and re-uses the parse method logic. My first attempt looked like this:

class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
    name = "atlantic_firearms_crawler"
    start_urls = [
        "http://www.atlanticfirearms.com"
    ]
    rules = (
        # I know, I need to update these to LxmlLinkExtractor
        Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
        Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
    )

Running this spider with

scrapy crawl atlantic_firearms_crawler

crawls the site but never parses any items. I think it's because CrawlSpider apparently has its own definition of parse, so somehow I'm screwing things up.

When I change callback='parse' to callback='parse_item' and rename the parse method in AtlanticFirearmsSpider to parse_item, it works wonderfully, crawling the whole site and parsing items successfully. But then if I try to call my original atlantic_firearms spider again, it errors out with NotImplementedError, apparently because Spider-based spiders really want one to define the parse method as parse.

What's the best way for me to re-use my logic between these spiders so that I can both feed a JSON array of start_urls as well as do full-site crawls?


回答1:


You can avoid multiple inheritance here.

Combine both spiders in a single one. If start_urls would be passed from the command-line - it would behave like a CrawlSpider, otherwise like a regular spider:

from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor

import json


class AtlanticFirearmsSpider(CrawlSpider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls=None, *args, **kwargs):
        if start_urls:
            self.start_urls = json.loads(start_urls)
            self.rules = []
            self.parse = self.parse_response
        else:
            self.start_urls = ["http://www.atlanticfirearms.com/"]
            self.rules = [
                Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
                Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
            ]

        super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)

    def parse_response(self, response):
        l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
        product = l.load_item()
        return product

Or, alternatively, just extract the logic inside the parse() method into a library function and call from both spiders that would not be related, separate spiders.



来源:https://stackoverflow.com/questions/28080496/how-can-i-reuse-the-parse-method-of-my-scrapy-spider-based-spider-in-an-inheriti

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!