scrapy-splash usage for rendering javascript

回眸只為那壹抹淺笑 提交于 2019-12-25 08:24:45

问题


This is a follow up of my previous quesion

I installed splash and scrapy-splash.

And also followed the instructions for scrapy-splash.

I edited my code as follows:

import scrapy
from scrapy_splash import SplashRequest

class CityDataSpider(scrapy.Spider):
    name = "citydata"

    def start_requests(self):
        urls = [
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
            ]
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

But still i get the same output. only one html file is generated and the result is only for http://www.city-data.com/advanced/search.php

is there anything wrong in the code or any other suggestions please.


回答1:


First off, I wanted to clear up some possible points of confusion from your last question which "@paul trmbrth" wrote:

The URL fragment (i.e. everything including and after #body) is not sent to the server and only http://www.city-data.com/advanced/search.php is fetched

So for Scrapy, the requests to [...] and [...] are the same resource, so it's only fetch once. They differ only in their URL fragments.

URI standards dictate that the number sign (#) be used to indicate the start of the fragment, which is the last part of the URL. In most/all browsers, nothing beyond the "#" is transmitted. However, it's fairly common for AJAX sites to utilize Javascript's window.location.hash grab the URL fragment, and use it to execute additional AJAX calls. I bring this up because city-data.com does exactly this, which may confuse you as it does in fact bring back two different sites for each of those URLs in a browser.

Scrapy does by default drop the URL fragment, so it will report both URLs as being just "http://www.city-data.com/advanced/search.php", and filter the second one.


With all of that out of the way, there will still be problem after you remove "#body" from the URLs caused by a combination of of page = response.url.split("/")[-2] and filename = 'citydata-%s.html' % page. Neither of your URL's redirect, so the URL provided is what will populate the response.url string.

Isolating that, we get the following:

>>> urls = [
>>>     'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
>>>     'http://www.city-data.com/advanced/search.php?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
>>> ]
>>> for url in urls:
...     print(url.split("/")[-2])

advanced
advanced

So, for both URL's, you're extracting the same piece of information, which means when you use filename = 'citydata-%s.html' % page you're going to get the same filename, which I assume would be 'citydata-advanced.html'. The second time it's called, you're overwriting the first file.

Depending on what you're doing with the data, you could either change this to append to the file, or modify your filename variable to something unique such as:

from urlparse import urlparse, parse_qs

import scrapy
from scrapy_splash import SplashRequest

class CityDataSpider(scrapy.Spider):

    [...]

    def parse(self, response):
        page = parse_qs(urlparse(response.url).query).get('p')
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)


来源:https://stackoverflow.com/questions/40262194/scrapy-splash-usage-for-rendering-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!