Can scrapy be used to scrape dynamic content from websites that are using AJAX?

前端 未结 8 774
星月不相逢
星月不相逢 2020-11-21 17:48

I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we

相关标签:
8条回答
  • 2020-11-21 18:18

    Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

    However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

    Some things to note:

    • You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

    • This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

      from scrapy.contrib.spiders import CrawlSpider, Rule
      from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
      from scrapy.selector import HtmlXPathSelector
      from scrapy.http import Request
      
      from selenium import selenium
      
      class SeleniumSpider(CrawlSpider):
          name = "SeleniumSpider"
          start_urls = ["http://www.domain.com"]
      
          rules = (
              Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
          )
      
          def __init__(self):
              CrawlSpider.__init__(self)
              self.verificationErrors = []
              self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
              self.selenium.start()
      
          def __del__(self):
              self.selenium.stop()
              print self.verificationErrors
              CrawlSpider.__del__(self)
      
          def parse_page(self, response):
              item = Item()
      
              hxs = HtmlXPathSelector(response)
              #Do some XPath selection with Scrapy
              hxs.select('//div').extract()
      
              sel = self.selenium
              sel.open(response.url)
      
              #Wait for javscript to load in Selenium
              time.sleep(2.5)
      
              #Do some crawling of javascript created content with Selenium
              sel.get_text("//div")
              yield item
      
      # Snippet imported from snippets.scrapy.org (which no longer works)
      # author: wynbennett
      # date  : Jun 21, 2011
      

    Reference: http://snipplr.com/view/66998/

    0 讨论(0)
  • 2020-11-21 18:18

    Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

    1) Define class within the middlewares.py script.

    from selenium import webdriver
    from scrapy.http import HtmlResponse
    
    class JsDownload(object):
    
        @check_spider_middleware
        def process_request(self, request, spider):
            driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
            driver.get(request.url)
            return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
    

    2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

    DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}
    

    3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

    class Spider(CrawlSpider):
        # define unique name of spider
        name = "spider"
    
        start_urls = ["https://www.url.de"] 
    
        def parse(self, response):
            # initialize items
            item = CrawlerItem()
    
            # store data as items
            item["js_enabled"] = response.body.decode("utf-8") 
    

    Optional Addon:
    I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

    def check_spider_middleware(method):
    @functools.wraps(method)
    def wrapper(self, request, spider):
        msg = '%%s %s middleware step' % (self.__class__.__name__,)
        if self.__class__ in spider.middleware:
            spider.log(msg % 'executing', level=log.DEBUG)
            return method(self, request, spider)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return None
    
    return wrapper
    

    for wrapper to work all spiders must have at minimum:

    middleware = set([])
    

    to include a middleware:

    middleware = set([MyProj.middleware.ModuleName.ClassName])
    

    Advantage:
    The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.

    0 讨论(0)
  • 2020-11-21 18:21

    Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

    All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):

    enter image description here

    When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

    enter image description here

    It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

    enter image description here

    And I observe the HTTP request that is responsible for message body:

    enter image description here

    After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):

    enter image description here

    And the form data content of the request (the HTTP method is "Post"):

    enter image description here

    And the content of response, which is a JSON file:

    enter image description here

    Which presents all the information I'm looking for.

    From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:

    class spider(BaseSpider):
        name = 'RubiGuesst'
        start_urls = ['http://www.rubin-kazan.ru/guestbook.html']
    
        def parse(self, response):
            url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
            yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                              formdata={'page': str(page + 1), 'uid': ''})
    
        def RubiGuessItem(self, response):
            json_file = response.body
    

    In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.

    0 讨论(0)
  • 2020-11-21 18:25

    I was using a custom downloader middleware, but wasn't very happy with it, as I didn't manage to make the cache work with it.

    A better approach was to implement a custom download handler.

    There is a working example here. It looks like this:

    # encoding: utf-8
    from __future__ import unicode_literals
    
    from scrapy import signals
    from scrapy.signalmanager import SignalManager
    from scrapy.responsetypes import responsetypes
    from scrapy.xlib.pydispatch import dispatcher
    from selenium import webdriver
    from six.moves import queue
    from twisted.internet import defer, threads
    from twisted.python.failure import Failure
    
    
    class PhantomJSDownloadHandler(object):
    
        def __init__(self, settings):
            self.options = settings.get('PHANTOMJS_OPTIONS', {})
    
            max_run = settings.get('PHANTOMJS_MAXRUN', 10)
            self.sem = defer.DeferredSemaphore(max_run)
            self.queue = queue.LifoQueue(max_run)
    
            SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
    
        def download_request(self, request, spider):
            """use semaphore to guard a phantomjs pool"""
            return self.sem.run(self._wait_request, request, spider)
    
        def _wait_request(self, request, spider):
            try:
                driver = self.queue.get_nowait()
            except queue.Empty:
                driver = webdriver.PhantomJS(**self.options)
    
            driver.get(request.url)
            # ghostdriver won't response when switch window until page is loaded
            dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
            dfd.addCallback(self._response, driver, spider)
            return dfd
    
        def _response(self, _, driver, spider):
            body = driver.execute_script("return document.documentElement.innerHTML")
            if body.startswith("<head></head>"):  # cannot access response header in Selenium
                body = driver.execute_script("return document.documentElement.textContent")
            url = driver.current_url
            respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
            resp = respcls(url=url, body=body, encoding="utf-8")
    
            response_failed = getattr(spider, "response_failed", None)
            if response_failed and callable(response_failed) and response_failed(resp, driver):
                driver.close()
                return defer.fail(Failure())
            else:
                self.queue.put(driver)
                return defer.succeed(resp)
    
        def _close(self):
            while not self.queue.empty():
                driver = self.queue.get_nowait()
                driver.close()
    

    Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:

    DOWNLOAD_HANDLERS = {
        'http': 'scraper.handlers.PhantomJSDownloadHandler',
        'https': 'scraper.handlers.PhantomJSDownloadHandler',
    }
    

    And voilà, the JS parsed DOM, with scrapy cache, retries, etc.

    0 讨论(0)
  • 2020-11-21 18:25

    I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference

    0 讨论(0)
  • 2020-11-21 18:27

    Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

    enter image description here

    In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

    Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

    After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

    Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.

    0 讨论(0)
提交回复
热议问题