Scrapy rules not working when process_request and callback parameter are set

人走茶凉 提交于 2020-01-03 15:54:52

问题


I have this rule for scrapy CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned, it calls parse_tagPage() to parse it. However, I realized that when request_tagPage() is used, spider doesn't call the parse_tagPage() at all. So in the actual code, I manually add parse_tagPage() callback function as a callback in request_tagPage, like this:

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls. However, before I manually set the parse_tagPage() as callback into request_tagPage(), the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage(), which I need to attach cookie in the request, parse_tagPage() , which used to parse a page and rules, which directs spider to crawl?


回答1:


Requests generated by CrawlSpider rules use internal callbacks and use meta to do their "magic".

I suggest that you don't recreate Requests from scratch in your rules' process_request hooks (or you'll probably end-up reimplementing what CrawlSpider does for you already).

Instead, if you just want to add cookies and special headers, you can use .replace() method on the request that is passed to request_tagPage, so that the "magic" of CrawlSpider is preserved.

Something like this should be enough:

def request_tagPage(self, request):
    tagged = request.replace(headers=self.headers)
    tagged.meta.update(cookiejar=1)
    return tagged



回答2:


I found the problem. CrawlSpider uses its default parse() to apply the rules. So when my custom parse_tagPage() is called, there is no more parse() follows up to keep applying the rules. Solution is to simply add the default parse() into my custom parse_tagPage(). It now looks like this:

def parse_tagPage(self, response):
    # parse the response, get the information I want...
    # save the information into a local file...
    return self.parse(response) # simply calls the default parse() function to enable the rules


来源:https://stackoverflow.com/questions/38280133/scrapy-rules-not-working-when-process-request-and-callback-parameter-are-set

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!