Python check if website exists

后端 未结 8 1856
甜味超标
甜味超标 2020-11-27 12:51

I wanted to check if a certain website exists, this is what I\'m doing:

user_agent = \'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)\'
headers = { \'User         


        
相关标签:
8条回答
  • 2020-11-27 13:10

    You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.

    def uri_exists(uri):
        r = requests.get(url, stream=True)
        if r.status_code == 200:
            return True
        else:
            return False
    
    0 讨论(0)
  • 2020-11-27 13:11

    There is an excellent answer provided by @Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.

    The previous answer for requests suggested something like the following:

    def uri_exists_get(uri: str) -> bool:
        try:
            response = requests.get(uri)
            try:
                response.raise_for_status()
                return True
            except requests.exceptions.HTTPError:
                return False
        except requests.exceptions.ConnectionError:
            return False
    

    requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.

    def uri_exists_stream(uri: str) -> bool:
        try:
            with requests.get(uri, stream=True) as response:
                try:
                    response.raise_for_status()
                    return True
                except requests.exceptions.HTTPError:
                    return False
        except requests.exceptions.ConnectionError:
            return False
    

    I ran the above snippets with timers attached against two web resources:

    1) http://bbb3d.renderfarming.net/download.html, a very light html page

    2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file

    Timing results below:

    uri_exists_get("http://bbb3d.renderfarming.net/download.html")
    # Completed in: 0:00:00.611239
    
    uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
    # Completed in: 0:00:00.000007
    
    uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
    # Completed in: 0:01:12.813224
    
    uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
    # Completed in: 0:00:00.000007
    

    As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.

    0 讨论(0)
  • 2020-11-27 13:13

    code:

    a="http://www.example.com"
    try:    
        print urllib.urlopen(a)
    except:
        print a+"  site does not exist"
    
    0 讨论(0)
  • 2020-11-27 13:15
    from urllib2 import Request, urlopen, HTTPError, URLError
    
    user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent':user_agent }
    link = "http://www.abc.com/"
    req = Request(link, headers = headers)
    try:
            page_open = urlopen(req)
    except HTTPError, e:
            print e.code
    except URLError, e:
            print e.reason
    else:
            print 'ok'
    

    To answer the comment of unutbu:

    Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. Source

    0 讨论(0)
  • 2020-11-27 13:15
    def isok(mypath):
        try:
            thepage = urllib.request.urlopen(mypath)
        except HTTPError as e:
            return 0
        except URLError as e:
            return 0
        else:
            return 1
    
    0 讨论(0)
  • 2020-11-27 13:21

    You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.

    import httplib
    c = httplib.HTTPConnection('www.example.com')
    c.request("HEAD", '')
    if c.getresponse().status == 200:
       print('web site exists')
    

    or you can use urllib2

    import urllib2
    try:
        urllib2.urlopen('http://www.example.com/some_page')
    except urllib2.HTTPError, e:
        print(e.code)
    except urllib2.URLError, e:
        print(e.args)
    

    or you can use requests

    import requests
    request = requests.get('http://www.example.com')
    if request.status_code == 200:
        print('Web site exists')
    else:
        print('Web site does not exist') 
    
    0 讨论(0)
提交回复
热议问题