OS: Ubuntu 16.04 Stack - Scrapy 1.0.3 + Selenium I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.
class CancerForumSpider(scrapy.Spider): name = "mainpage_spider" allowed_domains = ["cancerforums.net"] start_urls = [ "http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum" ] def __init__(self,*args,**kwargs): self.browser=webdriver.Firefox() self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum") print "----------------Going to sleep------------------" time.sleep(5) # self.parse() def __exit__(self): print "------------Exiting----------" self.browser.quit() def parse(self,response): print "----------------Inside Parse------------------" print "------------Exiting----------" self.browser.quit()
The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.
Following are the contents of the run logs:
----------------inside init---------------- ----------------Going to sleep------------------
There are a few problems you need to address or be aware of:
You're not calling super()
during the __init__
method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse()
method), as that all is setup in scrapy.Spider
.
After fixing the above, your parse()
method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls
). It's very much likely that these two sources will differ (often drastically).
You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()
's will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request
and response
) be populated with the data you expect.
Before you fix all of that, you should consider a couple of better options/alternatives:
- Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept
request
objects right before they hit the downloader, populate a new response
objects and return them for processing by the spider.
This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium. - Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
- Ditch Scrapy all together and just use Selenium and BeautifulSoup.
Scrapy is useful when you have to crawl a big amount of pages. Selenium is normally useful for scraping when you need to have the DOM source after the JS was loaded. If that's your situation, there are two main ways to combine Selenium and Scrapy. One is to write a download handler, like the one you can find here.
The code goes as:
# encoding: utf-8 from __future__ import unicode_literals from scrapy import signals from scrapy.signalmanager import SignalManager from scrapy.responsetypes import responsetypes from scrapy.xlib.pydispatch import dispatcher from selenium import webdriver from six.moves import queue from twisted.internet import defer, threads from twisted.python.failure import Failure class PhantomJSDownloadHandler(object): def __init__(self, settings): self.options = settings.get('PHANTOMJS_OPTIONS', {}) max_run = settings.get('PHANTOMJS_MAXRUN', 10) self.sem = defer.DeferredSemaphore(max_run) self.queue = queue.LifoQueue(max_run) SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed) def download_request(self, request, spider): """use semaphore to guard a phantomjs pool""" return self.sem.run(self._wait_request, request, spider) def _wait_request(self, request, spider): try: driver = self.queue.get_nowait() except queue.Empty: driver = webdriver.PhantomJS(**self.options) driver.get(request.url) # ghostdriver won't response when switch window until page is loaded dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle)) dfd.addCallback(self._response, driver, spider) return dfd def _response(self, _, driver, spider): body = driver.execute_script("return document.documentElement.innerHTML") if body.startswith("<head></head>"): # cannot access response header in Selenium body = driver.execute_script("return document.documentElement.textContent") url = driver.current_url respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8')) resp = respcls(url=url, body=body, encoding="utf-8") response_failed = getattr(spider, "response_failed", None) if response_failed and callable(response_failed) and response_failed(resp, driver): driver.close() return defer.fail(Failure()) else: self.queue.put(driver) return defer.succeed(resp) def _close(self): while not self.queue.empty(): driver = self.queue.get_nowait() driver.close()
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
DOWNLOAD_HANDLERS = { 'http': 'scraper.handlers.PhantomJSDownloadHandler', 'https': 'scraper.handlers.PhantomJSDownloadHandler', }
Another way is to write a download middleware, as described here. The download middleware has the downside of preventing some key features from working out of the box, such as cache and retries.
In any case, starting the Selenium webdriver at the init of the Scrapy spider is not the usual way to go.