Web scraping urlopen in python

后端未结

关注

 3  693

小蘑菇 2021-01-06 07:09

I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don\'t get the

3条回答

说谎 (楼主)

2021-01-06 07:30
What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).
```
import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
```
print(response) shows us the response from the server:
{'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}
While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.

The actual content lives in, you guessed it, content. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):
b'
Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.
0 讨论(0) 查看其它3个回答发布评论: 提交评论加载中...