问题
I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong?
class VideoSpider(CrawlSpider):
start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2']
rules = (
Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",),
)
def use_splash(self, request):
request.meta['splash'] = {
'endpoint':'render.html',
'args':{
'wait':0.5,
}
}
return request
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse_items(self, response):
data = response.body
print(data)
回答1:
Use SplashRequest instead of scrapy.Request... Check out my answer CrawlSpider with Splash
回答2:
def use_splash(self, request):
request.meta['splash'] = {
'endpoint':'render.html',
'args':{
'wait':0.5,
}
}
return request
You should amend it to
def use_splash(self, request):
return SplashRequest(xxxxxx)
or you can rewrite this function
def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
I can't guarantee it will work.I'm watching this, too.
来源:https://stackoverflow.com/questions/37978365/crawlspider-with-splash-getting-stuck-after-first-url