urlopen Returning Redirect Error for Valid Links

后端未结

关注

 3  575

I\'m building a broken link checker in python, and it\'s becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser.

相关标签:

3条回答

囚心锁ツ

2021-01-06 04:37

You get the infinite loop error because the page you want to scrape uses cookies and redirects when the cookie isn't sent by the client. You'll get the same error with most other scraper tools and also with browsers when you disallow cookies.

You need a http.cookiejar.CookieJar and a urllib.request.HTTPCookieProcessor to avoid the redirect loop:

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    cj = CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    response = opener.open(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)

0 讨论(0)

庸人自扰

2021-01-06 04:38
I concur with the comments in the 1st answer and it wasn't working for me (I was getting some encoded/compressed byte data, nothing readable)

The link mentioned used urllib2. It also works with urllib in python 3.7 as follow:
```
from urllib.request import build_opener, HTTPCookieProcessor
opener = build_opener(HTTPCookieProcessor())
response = opener.open('http://www.bad.org.uk')
print response.read()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2021-01-06 04:46

I tried the solutions above without success.

It appears that this problem can occur when the URL you are trying to open is badly formed (or just not what the REST service is expecting). For example, I found my problem was because I requested https://host.com/users/4484486 where the host was expecting a slash at the end: https://host.com/users/4484486/ solved the problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...