Scrapy With Splash Only Scrapes 1 Page

问题

I am trying to scrape multiple URLs, but for some reason only results for 1 site show. In every case it is the last URL in start_urls that is shown.

I believe I have the problem narrowed down to my parse function.

Any ideas on what I'm doing wrong?

Thanks!

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 8},
        )

def parse(self, response):
    for metric in response.css('.matrix-data'):
        yield {
            'City': response.css('title::text').extract_first(),
            'Metric Data Title': metric.css('.title::text').extract_first(),
            'Metric Data Price': metric.css('.price::text').extract_first(),
        }

EDIT:

I have altered my code to help debug. After running this code, my csv looks like this: csv results There is a row for every url, as there should be, but only one row is filled out with information.

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 8},
        )


def parse(self, response):
    yield {
        'City': response.css('title::text').extract_first(),
        'Metric Data Title': response.css('.matrix-data .title::text').extract(),
        'Metric Data Price': response.css('.matrix-data .price::text').extract(),
        'url': response.url,
    }

EDIT 2: Here is the full output http://pastebin.com/cLM3T05P On line 46 you can see the empty cells

回答1:

What worked for me was adding the delay between the requests:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

DOWNLOAD_DELAY = 5

Tested it on the 4 urls and got the results for all of them:

start_urls = [
    'https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=washington&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=philadelphia&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
]

回答2:

From the docs

start_requests()

This method must return an iterable with the first Requests to crawl for this spider.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.

You either specify the urls inside the start_requests() or override make_requests_from_url(url) to make the requests from the start_urls.

Example 1

start_urls = []
def start_requests(self):
    urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']
    for url in urls:
        yield SplashRequest(url, self.parse,
        endpoint='render.html',
        args={'wait': 8},
        dont_filter=True
        )

Example 2

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']
def make_requests_from_url(self, url):
    yield SplashRequest(url, self.parse,
        endpoint='render.html',
        args={'wait': 8},
        dont_filter=True
        )

回答3:

Are you sure scrapy-splash is configured properly?

Scrapy default dupefilter doesn't take URL fragments in account (i.e. part of URL after #) because this part is not sent to the server as a part of HTTP request. But fragment is important if you render a page in a browser.

scrapy-splash provides a custom dupefilter which takes fragment in account; to enable it set DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'. If you don't use this dupefilter both requests will have the same fingerprint (they are the same if fragment is removed), so the second request will be filtered out.

Try checking that all other settings are also correct (see https://github.com/scrapy-plugins/scrapy-splash#configuration).

来源：https://stackoverflow.com/questions/40363175/scrapy-with-splash-only-scrapes-1-page

标签

python

scrapy

scrapy-splash