Web scraping urlopen in python

问题

I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don't get the html code and I don't understand why. It goes like:

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

My code is right, I get the html source of other webpages with the same code, but it seems like it doesn't recognise this address.

it prints: b''

Maybe another library is more appropriate? Why urlopen doesn't return the html code of the webpage? help thanks!

回答1:

Personally , I write:

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tu parles français,.. bonjour sur stackoverflow.com !

update 1

In fact, I prefer now to employ the following code, because it is faster:

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

Changing httplib to http.client in this code should be enough to adapt it to Python 3.

I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

update 2

Adding the following snippet to the above code will allow you to extract the data I suppose you want:

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

result (only the end)

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.

update 3

SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:

regx = re.compile(b'\t\t\t\t\t......)

that is to say with b before the string, otherwise you'll get an error like in this question

回答2:

What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")

print(response) shows us the response from the server:
{'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.

The actual content lives in, you guessed it, content. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n\t"http://

Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.

回答3:

I have tested your URL with the httplib2 and on the terminal with curl. Both work fine:

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

So to me, either there is a bug in urllib.request or there is really weird client-server interaction happening.

来源：https://stackoverflow.com/questions/7158353/web-scraping-urlopen-in-python

标签

python

urlopen