Checking a url for a 404 error scrapy

后端 未结 2 863
孤城傲影
孤城傲影 2020-12-11 11:33

I\'m going through a set of pages and I\'m not certain how many there are, but the current page is represented by a simple number present in the url (e.g. \"http://www.websi

相关标签:
2条回答
  • 2020-12-11 12:05

    You need to yield/return the request in order to check the status, creating a Request object does not actually send it.

    class MySpider(BaseSpider):
        name = 'website.com'
        baseUrl = "http://website.com/page/"
    
        def start_requests(self):
            yield Request(self.baseUrl + '0')
    
        def parse(self, response):
            if response.status != 404:
                page = response.meta.get('page', 0) + 1
                return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))
    
    0 讨论(0)
  • 2020-12-11 12:12

    You can do something like this:

    from __future__ import print_function
    import urllib2
    
    baseURL = "http://www.website.com/page/"
    
    for n in xrange(100):
        fullURL = baseURL + str(n)
        #print fullURL
        try:
            req = urllib2.Request(fullURL)
            resp = urllib2.urlopen(req)
            if resp.getcode() == 404:
                #Do whatever you want if 404 is found
                print ("404 Found!")
            else:
                #Do your normal stuff here if page is found.
                print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
        except:
            print ("Could not connect to URL: {0} ".format(fullURL))
    

    This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

    Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.

    0 讨论(0)
提交回复
热议问题