Can scrapy be used to scrape dynamic content from websites that are using AJAX?

前端 未结 8 785
星月不相逢
星月不相逢 2020-11-21 17:48

I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we

相关标签:
8条回答
  • 2020-11-21 18:31

    yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.

    There are Two approaches to scrapy these kind of websites.

    First,

    you can use splash to render Javascript code and then parse the rendered HTML. you can find the doc and project here Scrapy splash, git

    Second,

    As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.

    0 讨论(0)
  • 2020-11-21 18:43

    how can scrapy be used to scrape this dynamic data so that I can use it?

    I wonder why no one has posted the solution using Scrapy only.

    Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES . The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.

    The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.

    import json
    import scrapy
    
    
    class SpidyQuotesSpider(scrapy.Spider):
        name = 'spidyquotes'
        quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
        start_urls = [quotes_base_url % 1]
        download_delay = 1.5
    
        def parse(self, response):
            data = json.loads(response.body)
            for item in data.get('quotes', []):
                yield {
                    'text': item.get('text'),
                    'author': item.get('author', {}).get('name'),
                    'tags': item.get('tags'),
                }
            if data['has_next']:
                next_page = data['page'] + 1
                yield scrapy.Request(self.quotes_base_url % next_page)
    
    0 讨论(0)
提交回复
热议问题