Scrapy shell works but actual script returns 404 error

淺唱寂寞╮ 提交于 2021-01-29 04:17:54

问题


scrapy shell http://www.zara.com/us

Returns a correct 200 code

2017-01-05 18:34:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara)
2017-01-05 18:34:20 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'zara (+http://www.yourdomain.com)'}
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-05 18:34:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-05 18:34:20 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-05 18:34:20 [scrapy.core.engine] INFO: Spider opened
2017-01-05 18:34:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zara.com/robots.txt> (referer: None) ['cached']
2017-01-05 18:34:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.zara.com/us/> from <GET http://www.zara.com/us>
2017-01-05 18:34:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zara.com/us/> (referer: None) ['cached']

But my actual py script causes a 404 error when I try to enter www.zara.com/us. If I use wwww.zara.com, the page returns a 200 but when I try to do it with the country specific site, it returns a 404...

class ZaraSpider(scrapy.Spider):

    name = "zara-us"
    allowed_domain = ['www.zara.com/us']
    start_urls = [
        "http://www.zara.com/us"
    ]
    handle_httpstatus_list = [404]

    # navigating main page
    def parse(self, response):

        # get 1st 2 category listing in navigation sidebar
        categories = response.xpath('//*[@id="menu"]/ul/li')
        collections = categories[0].xpath('a//text()').extract()
        yield ProductItem(collection=collections[0])

Typing in terminal: scrapy crawl zara-us

2017-01-05 18:45:24 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara)
2017-01-05 18:45:24 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'USER_AGENT': 'zara (+http://www.yourdomain.com)'}
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-05 18:45:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-05 18:45:24 [scrapy.core.engine] INFO: Spider opened
2017-01-05 18:45:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-05 18:45:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-05 18:45:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.zara.com/robots.txt> (referer: None) ['cached']
2017-01-05 18:45:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.zara.com/us> (referer: None) ['cached']
2017-01-05 18:45:25 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.zara.com/us> (referer: None)
`

回答1:


Scrapy by default for every new projects turns on ROBOTS_TXT_OBEY to True, which means before your spider can scrape anything it checks websites robots.txt file for what is allowed and disallowed to be scraped.

To disable this simply delete the setting ROBOTS_TXT_OBEY from settings.py file.

See more here



来源:https://stackoverflow.com/questions/41497697/scrapy-shell-works-but-actual-script-returns-404-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!