问题
I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :
scrapy shell 'https://www.leboncoin.fr'
But, I haven't a response of the site.
$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x1039fbd30>
[s] item {}
[s] request <GET https://www.leboncoin.fr>
[s] settings <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
If I use :
view(response)
An AttributeError is printed...
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)
/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
67 from scrapy.http import HtmlResponse, TextResponse
68 # XXX: this implementation is a bit dirty and could be improved
---> 69 body = response.body
70 if isinstance(response, HtmlResponse):
71 if b'<base' not in body:
AttributeError: 'NoneType' object has no attribute 'body'
Edit 1 :
To rrschmidt : the complete log was updated and when I run
fetch('https:www.leboncoin.fr')
I receive this :
2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
So, How can I fix it ?
Thanks for your answers,
Chris
回答1:
It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.
But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py
ROBOTSTXT_OBEY=False
来源:https://stackoverflow.com/questions/43973898/scrapy-shell-return-without-response