I\'m building a broken link checker in python, and it\'s becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser.
You get the infinite loop error because the page you want to scrape uses cookies and redirects when the cookie isn't sent by the client. You'll get the same error with most other scraper tools and also with browsers when you disallow cookies.
You need a http.cookiejar.CookieJar
and a urllib.request.HTTPCookieProcessor
to avoid the redirect loop:
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)