问题
I currently have a Spider-based spider that I wrote for crawling an input JSON array of start_urls
:
from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
import json
import datetime
import re
class AtlanticFirearmsSpider(Spider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls='[]', *args, **kwargs):
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
self.start_urls = json.loads(start_urls)
def parse(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
I can call it from the command line like so, and it does a wonderful job:
scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'
However, I'm trying to add a CrawlSpider-based spider for crawling the entire site that inherits from it and re-uses the parse
method logic. My first attempt looked like this:
class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
name = "atlantic_firearms_crawler"
start_urls = [
"http://www.atlanticfirearms.com"
]
rules = (
# I know, I need to update these to LxmlLinkExtractor
Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
)
Running this spider with
scrapy crawl atlantic_firearms_crawler
crawls the site but never parses any items. I think it's because CrawlSpider apparently has its own definition of parse, so somehow I'm screwing things up.
When I change callback='parse'
to callback='parse_item'
and rename the parse
method in AtlanticFirearmsSpider
to parse_item
, it works wonderfully, crawling the whole site and parsing items successfully. But then if I try to call my original atlantic_firearms
spider again, it errors out with NotImplementedError
, apparently because Spider-based spiders really want one to define the parse method as parse
.
What's the best way for me to re-use my logic between these spiders so that I can both feed a JSON array of start_urls
as well as do full-site crawls?
回答1:
You can avoid multiple inheritance here.
Combine both spiders in a single one. If start_urls
would be passed from the command-line - it would behave like a CrawlSpider
, otherwise like a regular spider:
from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor
import json
class AtlanticFirearmsSpider(CrawlSpider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls=None, *args, **kwargs):
if start_urls:
self.start_urls = json.loads(start_urls)
self.rules = []
self.parse = self.parse_response
else:
self.start_urls = ["http://www.atlanticfirearms.com/"]
self.rules = [
Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
]
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
def parse_response(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
Or, alternatively, just extract the logic inside the parse()
method into a library function and call from both spiders that would not be related, separate spiders.
来源:https://stackoverflow.com/questions/28080496/how-can-i-reuse-the-parse-method-of-my-scrapy-spider-based-spider-in-an-inheriti