Scrapy not calling any other function after “__init__”

那年仲夏 提交于 2019-12-12 09:01:41

问题


OS: Ubuntu 16.04 Stack - Scrapy 1.0.3 + Selenium I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.

class CancerForumSpider(scrapy.Spider):
    name = "mainpage_spider"
    allowed_domains = ["cancerforums.net"]
    start_urls = [
        "http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum"
    ]

    def __init__(self,*args,**kwargs):
        self.browser=webdriver.Firefox()
        self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum")
        print "----------------Going to sleep------------------"
        time.sleep(5)
        # self.parse()

    def __exit__(self):
        print "------------Exiting----------"
        self.browser.quit()

    def parse(self,response):
        print "----------------Inside Parse------------------"
        print "------------Exiting----------"
        self.browser.quit()

The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.

Following are the contents of the run logs:

----------------inside init---------------- ----------------Going to sleep------------------


回答1:


There are a few problems you need to address or be aware of:

  1. You're not calling super() during the __init__ method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse() method), as that all is setup in scrapy.Spider.

  2. After fixing the above, your parse() method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls). It's very much likely that these two sources will differ (often drastically).

  3. You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()'s will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request and response) be populated with the data you expect.

Before you fix all of that, you should consider a couple of better options/alternatives:

  • Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept request objects right before they hit the downloader, populate a new response objects and return them for processing by the spider.
    This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium.
  • Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
  • Ditch Scrapy all together and just use Selenium and BeautifulSoup.



回答2:


Scrapy is useful when you have to crawl a big amount of pages. Selenium is normally useful for scraping when you need to have the DOM source after the JS was loaded. If that's your situation, there are two main ways to combine Selenium and Scrapy. One is to write a download handler, like the one you can find here.

The code goes as:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

Another way is to write a download middleware, as described here. The download middleware has the downside of preventing some key features from working out of the box, such as cache and retries.

In any case, starting the Selenium webdriver at the init of the Scrapy spider is not the usual way to go.



来源:https://stackoverflow.com/questions/40274984/scrapy-not-calling-any-other-function-after-init

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!