I have the following code that is partially working,
class ThreadSpider(CrawlSpider):
name = \'thread\'
allowed_domains = [\'bbs.example.com\']
star
I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:
You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.
from urllib.parse import urlencode, parse_qs
With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.
allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']
However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.
LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',
Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.
def process_links(self, links):
for link in links:
if "http://localhost:8050/render.html?&" not in link.url:
link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
'wait':2.0})
return links
Finally, to take the url back out of the splash url, use the parse_qs method.
parse_qs(response.url)['url'][0]
One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.