Can scrapy be used to scrape dynamic content from websites that are using AJAX?

前端 未结 8 796
星月不相逢
星月不相逢 2020-11-21 17:48

I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we

8条回答
  •  醉话见心
    2020-11-21 18:21

    Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

    All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):

    enter image description here

    When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

    enter image description here

    It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

    enter image description here

    And I observe the HTTP request that is responsible for message body:

    enter image description here

    After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):

    enter image description here

    And the form data content of the request (the HTTP method is "Post"):

    enter image description here

    And the content of response, which is a JSON file:

    enter image description here

    Which presents all the information I'm looking for.

    From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:

    class spider(BaseSpider):
        name = 'RubiGuesst'
        start_urls = ['http://www.rubin-kazan.ru/guestbook.html']
    
        def parse(self, response):
            url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
            yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                              formdata={'page': str(page + 1), 'uid': ''})
    
        def RubiGuessItem(self, response):
            json_file = response.body
    

    In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.

提交回复
热议问题