问题
Language Ver: Python 3.6.3
IDE Ver: PyCharm 2017.2.3
I was trying to parse a weather website to print weather for a place. As I am learning Python, previously I used urllib.request.urlopen(url).read() and it worked. Now, I am modifying the code to BeautifulSoup4 and requests module. Below is my code:
from bs4 import *
import requests
url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844"
data = requests.get(url)
soup = BeautifulSoup(data.text, "html.parser")
print(soup.find('div', {'class': 'info'}))
But each time I try to run the code it gives me following error:
Traceback (most recent call last):
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1009, in recv_into
return self.read(nbytes, buffer)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 871, in read
return self._sslobj.read(len, buffer)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
timeout=timeout
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 357, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1009, in recv_into
return self.read(nbytes, buffer)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 871, in read
return self._sslobj.read(len, buffer)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/Projects/Python/Practice/Practice1.py", line 5, in
data = requests.get(url)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 490, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))
Process finished with exit code 1
What is this error and how to correct it? And why it worked in urllib, but not in requests?
回答1:
I used your code straight up and I got the same error then I followed how the requests are sent in browser. Some servers don't respond if expected headers are not sent with request that they use as part of backend processing. Turns out the server was looking for a header called user-agent
usually used to determine what client the request is from. Now, amended code below which works!
from bs4 import *
import requests
url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844"
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, "html.parser")
Now you can play with your soup!
You can in fact pass more headers like accept, dnt, pragma, accept-language, cache-control
etc. Explanation of these http headers are for another question, another time. Hope it helps :)
回答2:
Try increasing the timeout parameter of your requests.get method :
requests.get(url, headers=headers, timeout=5)
But if your script is being blocked by the server to prevent scrapping attempts . If this is the case you can try faking a web browser by setting appropriate headers .
{"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}
your final code
import requests
url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844"
headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}
data = requests.get(url,headers=headers,timeout=5)
来源:https://stackoverflow.com/questions/47375153/requests-get-in-python-giving-connection-timeout-error