Python check if website exists

I wanted to check if a certain website exists, this is what I'm doing:

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!

If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?

You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.

import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
   print('web site exists')

or you can use urllib2

import urllib2
try:
    urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
    print(e.code)
except urllib2.URLError, e:
    print(e.args)

or you can use requests

import requests
request = requests.get('http://www.example.com')
if request.status_code == 200:
    print('Web site exists')
else:
    print('Web site does not exist')

alecxe

It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):

1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error

If you want to check if page exists and don't want to download the whole page, you should use Head Request:

import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400

taken from this answer.

If you want to download the whole page, just make a normal request and check the status code. Example using requests:

import requests

response = requests.get('http://google.com')
assert response.status_code < 400

See also similar topics:

Hope that helps.

from urllib2 import Request, urlopen, HTTPError, URLError

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
        page_open = urlopen(req)
except HTTPError, e:
        print e.code
except URLError, e:
        print e.reason
else:
        print 'ok'

To answer the comment of unutbu:

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. Source

Raj

code:

a="http://www.example.com"
try:    
    print urllib.urlopen(a)
except:
    print a+"  site does not exist"

There is an excellent answer provided by @Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.

The previous answer for requests suggested something like the following:

def uri_exists_get(uri: str) -> bool:
    try:
        response = requests.get(uri)
        try:
            response.raise_for_status()
            return True
        except requests.exceptions.HTTPError:
            return False
    except requests.exceptions.ConnectionError:
        return False

requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.

def uri_exists_stream(uri: str) -> bool:
    try:
        with requests.get(uri, stream=True) as response:
            try:
                response.raise_for_status()
                return True
            except requests.exceptions.HTTPError:
                return False
    except requests.exceptions.ConnectionError:
        return False

I ran the above snippets with timers attached against two web resources:

1) http://bbb3d.renderfarming.net/download.html, a very light html page

2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file

Timing results below:

uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239

uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007

uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224

uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007

As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.

def isok(mypath):
    try:
        thepage = urllib.request.urlopen(mypath)
    except HTTPError as e:
        return 0
    except URLError as e:
        return 0
    else:
        return 1

Try this one::

import urllib2  
website='https://www.allyourmusic.com'  
try:  
    response = urllib2.urlopen(website)  
    if response.code==200:  
        print("site exists!")  
    else:  
        print("site doesn't exists!")  
except urllib2.HTTPError, e:  
    print(e.code)  
except urllib2.URLError, e:  
    print(e.args)

来源：https://stackoverflow.com/questions/16778435/python-check-if-website-exists

标签

python

html

urlopen