scrape hidden pages if search yields more results than displayed

前端 未结 1 717
自闭症患者
自闭症患者 2021-01-26 04:03

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1\'000 results (shown dynamically on the search page). The results howev

相关标签:
1条回答
  • 2021-01-26 04:35

    It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

    Assuming you use scrapy, you can do the following:

    1. Start a Splash server using docker - make a note of the
    2. In settings.py add SPLASH_URL = <splash-server-ip-address>
    3. In settings.py add to middlewares

    this code:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    1. Import from scrapy_splash import SplashRequest in your spider.py
    2. Set start_url in your spider.py to iterate over the pages

    E.g. like this

    base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
    start_urls = [
         base_url + str('?page=') + str(page) % page for page in range(0,100)      
        ]
    
    1. Redirect the url to the splash server by modifing def start_requests(self):

    E.g. like this

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='render.html',
                args={'wait': 0.5},
            )
    
    1. Parse the response like you do now.

    Let me know how that works out for you.

    0 讨论(0)
提交回复
热议问题