scrape hidden pages if search yields more results than displayed

前端未结

关注

 1  718

自闭症患者 2021-01-26 04:03

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1\'000 results (shown dynamically on the search page). The results howev

1条回答

说谎 (楼主)

2021-01-26 04:35
It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

Assuming you use scrapy, you can do the following:
1. Start a Splash server using docker - make a note of the
2. In settings.py add SPLASH_URL =
3. In settings.py add to middlewares
this code:
```
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
```
1. Import from scrapy_splash import SplashRequest in your spider.py
2. Set start_url in your spider.py to iterate over the pages
E.g. like this
```
base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
    ]
```
1. Redirect the url to the splash server by modifing def start_requests(self):
E.g. like this
```
def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 0.5},
        )
```
1. Parse the response like you do now.
Let me know how that works out for you.
0 讨论(0)
发布评论:

提交评论
- 加载中...