Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

前端 未结 3 509
忘了有多久
忘了有多久 2021-02-09 10:46

I have the following code that is partially working,

class ThreadSpider(CrawlSpider):
    name = \'thread\'
    allowed_domains = [\'bbs.example.com\']
    star         


        
相关标签:
3条回答
  • 2021-02-09 11:23

    I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:

    You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.

    from urllib.parse import urlencode, parse_qs
    

    With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.

    allowed_domains = ['localhost']
    start_urls = ['https://www.example.com/']
    

    However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.

    LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
    process_links='process_links',
    

    Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.

    def process_links(self, links):
        for link in links:
            if "http://localhost:8050/render.html?&" not in link.url:
                link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                              'wait':2.0})
        return links
    

    Finally, to take the url back out of the splash url, use the parse_qs method.

    parse_qs(response.url)['url'][0] 
    

    One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.

    0 讨论(0)
  • 2021-02-09 11:37

    Use below code - Just copy and paste

    restrict_xpaths=('//a[contains(text(), "Next Page")]')
    

    Instead of

    restrict_xpaths=("//a[contains(text(), 'Next Page')]")
    
    0 讨论(0)
  • 2021-02-09 11:41

    Seems to be related to https://github.com/scrapy-plugins/scrapy-splash/issues/92

    Personnaly I use dont_process_response=True so response is HtmlResponse (which is required by the code in _request_to_follows).

    And I also redefine the _build_request method in my spyder, like so:

    def _build_request(self, rule, link):
        r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
        r.meta.update(rule=rule, link_text=link.text)
        return r 
    

    In the github issues, some users just redefine the _request_to_follow method in their class.

    0 讨论(0)
提交回复
热议问题