问题
scrapy shell http://www.zara.com/us
Returns a correct 200 code
2017-01-05 18:34:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara)
2017-01-05 18:34:20 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'zara (+http://www.yourdomain.com)'}
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-05 18:34:20 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-05 18:34:20 [scrapy.core.engine] INFO: Spider opened
2017-01-05 18:34:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zara.com/robots.txt> (referer: None) ['cached']
2017-01-05 18:34:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.zara.com/us/> from <GET http://www.zara.com/us>
2017-01-05 18:34:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zara.com/us/> (referer: None) ['cached']
But my actual py script causes a 404 error when I try to enter www.zara.com/us. If I use wwww.zara.com, the page returns a 200 but when I try to do it with the country specific site, it returns a 404...
class ZaraSpider(scrapy.Spider):
name = "zara-us"
allowed_domain = ['www.zara.com/us']
start_urls = [
"http://www.zara.com/us"
]
handle_httpstatus_list = [404]
# navigating main page
def parse(self, response):
# get 1st 2 category listing in navigation sidebar
categories = response.xpath('//*[@id="menu"]/ul/li')
collections = categories[0].xpath('a//text()').extract()
yield ProductItem(collection=collections[0])
Typing in terminal:
scrapy crawl zara-us
2017-01-05 18:45:24 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara)
2017-01-05 18:45:24 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'USER_AGENT': 'zara (+http://www.yourdomain.com)'}
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-05 18:45:24 [scrapy.core.engine] INFO: Spider opened
2017-01-05 18:45:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-05 18:45:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-05 18:45:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.zara.com/robots.txt> (referer: None) ['cached']
2017-01-05 18:45:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.zara.com/us> (referer: None) ['cached']
2017-01-05 18:45:25 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.zara.com/us> (referer: None)
`
回答1:
Scrapy by default for every new projects turns on ROBOTS_TXT_OBEY
to True, which means before your spider can scrape anything it checks websites robots.txt
file for what is allowed and disallowed to be scraped.
To disable this simply delete the setting ROBOTS_TXT_OBEY
from settings.py
file.
See more here
来源:https://stackoverflow.com/questions/41497697/scrapy-shell-works-but-actual-script-returns-404-error