问题
I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request
.
This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:
class BroadCrawlSpider(CrawlSpider):
name = 'bitsy'
start_urls = ['http://scrapy.org']
foo = 5
rules = (
Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
)
def parse_item(self, response):
<some code>
def filter_requests(self, request):
if self.foo == 6 and request.headers.get('Referer', None) == someval:
raise IgnoreRequest("Ignored request: bla %s" % request)
return request
I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.
Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?
I could probably work around it with a dictionary with values per process ID, but that will be ugly...
回答1:
I think spider arguments would be the solution in your case.
When invoking scrapy like scrapy crawl some_spider
, you could add arguments like scrapy crawl some_spider -a foo=bar
, and the spider would receive the values via its constructor, e.g.:
class SomeSpider(scrapy.Spider):
def __init__(self, foo=None, *args, **kwargs):
super(SomeSpider, self).__init__(*args, **kwargs)
# Do something with foo
What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__
method but just access the .foo
attribute. :)
来源:https://stackoverflow.com/questions/39186207/how-to-add-instance-variable-to-scrapy-crawlspider