Checking a url for a 404 error scrapy

后端未结

关注

 2  863

I\'m going through a set of pages and I\'m not certain how many there are, but the current page is represented by a simple number present in the url (e.g. \"http://www.websi

相关标签:

2条回答

忘了有多久

2020-12-11 12:05

You need to yield/return the request in order to check the status, creating a Request object does not actually send it.

class MySpider(BaseSpider):
    name = 'website.com'
    baseUrl = "http://website.com/page/"

    def start_requests(self):
        yield Request(self.baseUrl + '0')

    def parse(self, response):
        if response.status != 404:
            page = response.meta.get('page', 0) + 1
            return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

0 讨论(0)

半阙折子戏

2020-12-11 12:12
You can do something like this:
```
from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))
```
This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.
0 讨论(0)
发布评论:

提交评论
- 加载中...