webpage access while using scrapy

问题

I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0 and http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1 respectively.

when the html file is generated the whole data is not getting displayed. only the data upto this URL - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=0&sd=0&states=ALL&near=&ps=20&p=0 is shown.

Also while the command is run the second URL has been removed stating it as duplicate and only one html file is being created.

I want to know if the webpage denies access to that specific data or should i change my code to get the precise data.

When i further give the shell command i am getting error. The result when i used the crawl command and shell command was -

    C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline crawl citydata
2016-10-19 12:00:27 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:00:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True,
 'BOT_NAME': 'tutorial'}
2016-10-19 12:00:27 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:00:27 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:00:27 [scrapy] INFO: Spider opened
2016-10-19 12:00:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-10-19 12:00:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:00:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.
city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=
&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX
&i6819=1&ps=20&p=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to
show all duplicates)
2016-10-19 12:00:28 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
robots.txt> (referer: None)
2016-10-19 12:00:29 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=691
4&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20
&p=0> (referer: None)
2016-10-19 12:00:29 [citydata] DEBUG: Saved file citydata-advanced.html
2016-10-19 12:00:29 [scrapy] INFO: Closing spider (finished)
2016-10-19 12:00:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 459,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 44649,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 19, 6, 30, 29, 751000),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 10, 19, 6, 30, 27, 910000)}
2016-10-19 12:00:29 [scrapy] INFO: Spider closed (finished)

C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline shell 'http://w
ww.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&ne
ar=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=
MAX&i6819=1&ps=20&p=0'
2016-10-19 12:21:51 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:21:51 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters
.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'
, 'LOGSTATS_INTERVAL': 0}
2016-10-19 12:21:51 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:21:51 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:21:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:21:51 [scrapy] INFO: Spider opened
2016-10-19 12:21:53 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 1 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:56 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 2 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:58 [scrapy] DEBUG: Gave up retrying <GET http://'http:/robots.t
xt> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 1100
4] getaddrinfo failed.
2016-10-19 12:21:58 [scrapy] ERROR: Error downloading <GET http://'http:/robots.
txt>: DNS lookup failed: address "'http:" not found: [Errno 11004] getaddrinfo f
ailed.
DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11004] get
addrinfo failed.
2016-10-19 12:22:00 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 1 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:03 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 2 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:05 [scrapy] DEBUG: Gave up retrying <GET http://'http://www.cit
y-data.com/advanced/search.php#body?fips=0> (failed 3 times): DNS lookup failed:
 address "'http:" not found: [Errno 11004] getaddrinfo failed.
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 161, in <module>
    execute()
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 88, in _run_print
_help
    func(*a, **kw)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 149, in _run_comm
and
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy\commands\shell.py", line 71, in run

    shell.start(url=url)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 47, in start
    self.fetch(url, spider)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "C:\Python27\lib\site-packages\twisted\internet\threads.py", line 122, in
 blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.internet.error.DNSLookupError: DNS lookup failed: address "'http:" not f
ound: [Errno 11004] getaddrinfo failed.
'csize' is not recognized as an internal or external command,
operable program or batch file.
'sc' is not recognized as an internal or external command,
operable program or batch file.
'sd' is not recognized as an internal or external command,
operable program or batch file.
'states' is not recognized as an internal or external command,
operable program or batch file.
'near' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit1' is not recognized as an internal or external command,
operable program or batch file.
'b6914' is not recognized as an internal or external command,
operable program or batch file.
'e6914' is not recognized as an internal or external command,
operable program or batch file.
'i6914' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit2' is not recognized as an internal or external command,
operable program or batch file.
'b6819' is not recognized as an internal or external command,
operable program or batch file.
'e6819' is not recognized as an internal or external command,
operable program or batch file.
'i6819' is not recognized as an internal or external command,
operable program or batch file.
'ps' is not recognized as an internal or external command,
operable program or batch file.
'p' is not recognized as an internal or external command,
operable program or batch file.

My code is

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "citydata"

    def start_requests(self):
        urls = [
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

Someone please guide me through this.

回答1:

First of all, this website looks like a JavaScript-heavy one. Scrapy itself only downloads HTML from servers but does not interpret JavaScript statements.

Second, the URL fragment (i.e. everything including and after #body) is not sent to the server and only http://www.city-data.com/advanced/search.php is fetched (scrapy does the same as your browser. You can confirm that with your browser's dev tools network tab.)

So for Scrapy, the requests to

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0

and

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1

are the same resource, so it's only fetch once. They differ only in their URL fragments.

What you need is a JavaScript renderer. You could use Selenium or something like Splash. I recommend using the scrapy-splash plugin which includes a duplicate filter that takes into account URL fragments.

来源：https://stackoverflow.com/questions/40125256/webpage-access-while-using-scrapy

标签

python

python-2.7

web-scraping

scrapy

scrapy-spider