Web scraping urlopen in python

后端 未结 3 693
小蘑菇
小蘑菇 2021-01-06 07:09

I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don\'t get the

3条回答
  •  说谎
    说谎 (楼主)
    2021-01-06 07:30

    What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
    I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).

    import httplib2
    folder = httplib2.Http('.cache')
    response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
    

    print(response) shows us the response from the server:
    {'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

    While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.

    The actual content lives in, you guessed it, content. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):
    b'

    Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.

提交回复
热议问题