问题
I have this rule for scrapy CrawlSpider
rules = [
Rule(LinkExtractor(
allow= '/topic/\d+/organize$',
restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
),
process_request='request_tagPage', callback = "parse_tagPage", follow = True)
]
request_tagePage()
refers to a function to add cookie into requests and parse_tagPage()
refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage
to make requests and once responses are returned, it calls parse_tagPage()
to parse it. However, I realized that when request_tagPage()
is used, spider doesn't call the parse_tagPage()
at all. So in the actual code, I manually add parse_tagPage()
callback function as a callback in request_tagPage
, like this:
def request_tagPage(self, request):
return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
headers = self.headers,\
callback=self.parse_tagPage) # manually add a callback function.
It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls
. However, before I manually set the parse_tagPage()
as callback into request_tagPage()
, the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage()
, which I need to attach cookie in the request, parse_tagPage()
, which used to parse a page and rules
, which directs spider to crawl?
回答1:
Requests generated by CrawlSpider
rules use internal callbacks and use meta to do their "magic".
I suggest that you don't recreate Requests from scratch in your rules' process_request
hooks (or you'll probably end-up reimplementing what CrawlSpider
does for you already).
Instead, if you just want to add cookies and special headers, you can use .replace() method on the request that is passed to request_tagPage
, so that the "magic" of CrawlSpider
is preserved.
Something like this should be enough:
def request_tagPage(self, request):
tagged = request.replace(headers=self.headers)
tagged.meta.update(cookiejar=1)
return tagged
回答2:
I found the problem. CrawlSpider
uses its default parse()
to apply the rules. So when my custom parse_tagPage()
is called, there is no more parse()
follows up to keep applying the rules. Solution is to simply add the default parse()
into my custom parse_tagPage()
. It now looks like this:
def parse_tagPage(self, response):
# parse the response, get the information I want...
# save the information into a local file...
return self.parse(response) # simply calls the default parse() function to enable the rules
来源:https://stackoverflow.com/questions/38280133/scrapy-rules-not-working-when-process-request-and-callback-parameter-are-set