webpage returns 405 status code error when accessed with scrapy

被刻印的时光 ゝ 提交于 2019-12-21 06:29:09

问题


I am trying to scrap below URL with scrapy -

https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n

but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here.

here is my code for spider -

import scrapy

class sampleSpider(scrapy.Spider):
    AUTOTHROTTLE_ENABLED = True
    name = 'test'
    start_urls = ['https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n']

    def parse(self, response):


        yield {
            'response' : response.body_as_unicode(),
        }

and here is the log I get when I run the scraper -

PS D:\> scrapy runspider tst.py -o tst.csv
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'tst.csv'}
2017-06-26 19:20:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-26 19:20:50 [scrapy.core.engine] INFO: Spider opened
2017-06-26 19:20:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-06-26 19:20:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-26 19:20:51 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.realtor.ca/Residential/Single-Family/1827
9532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n> (referer: None)
2017-06-26 19:20:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.realtor.ca/Residential
/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook>: HTTP status code is not handled or
 not allowed
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-26 19:20:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 306,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 9360,
 'downloader/response_count': 1,
 'downloader/response_status_count/405': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 26, 13, 50, 51, 432000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 26, 13, 50, 50, 104000)}
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Spider closed (finished)

Any help will be very much appreciated. Thank you in advance.


回答1:


I encountered a similar problem trying to scrape www.funda.nl and solved it by

  1. changing the user agent (using https://pypi.org/project/scrapy-random-useragent/),
  2. using Scrapy Splash.

This may work for the website you're trying to scrape as well (although I haven't tested this).



来源:https://stackoverflow.com/questions/44761497/webpage-returns-405-status-code-error-when-accessed-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!