Web scraping urlopen in python

后端 未结 3 696
小蘑菇
小蘑菇 2021-01-06 07:09

I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don\'t get the

相关标签:
3条回答
  • 2021-01-06 07:14

    I have tested your URL with the httplib2 and on the terminal with curl. Both work fine:

    URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
    h = httplib2.Http()
    resp, content = h.request(URL, "GET")
    print(content)
    

    So to me, either there is a bug in urllib.request or there is really weird client-server interaction happening.

    0 讨论(0)
  • 2021-01-06 07:29

    Personally , I write:

    # Python 2.7
    
    import urllib
    
    url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
    sock = urllib.urlopen(url)
    content = sock.read() 
    sock.close()
    
    print content
    

    Et si tu parles français,.. bonjour sur stackoverflow.com !

    update 1

    In fact, I prefer now to employ the following code, because it is faster:

    # Python 2.7
    
    import httplib
    
    conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)
    
    req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'
    
    try:
        conn.request('GET',req)
    except:
         print 'echec de connexion'
    
    content = conn.getresponse().read()
    
    print content
    

    Changing httplib to http.client in this code should be enough to adapt it to Python 3.

    .

    I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:

            <td class="L20" width="33%" align="center">11:57:44</td>
    
            <td class="L20" width="33%" align="center">1.4486</td>
    
            <td class="L20" width="33%" align="center">0</td>
    
    </tr>
    
                                            <tr>
    
            <td  width="33%" align="center">11:57:43</td>
    
            <td  width="33%" align="center">1.4486</td>
    
            <td  width="33%" align="center">0</td>
    
    </tr>
    

    update 2

    Adding the following snippet to the above code will allow you to extract the data I suppose you want:

    for i,line in enumerate(content.splitlines(True)):
        print str(i)+' '+repr(line)
    
    print '\n\n'
    
    
    import re
    
    regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                      '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                      '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')
    
    print regx.findall(content)
    

    result (only the end)

    .......................................
    .......................................
    .......................................
    .......................................
    98 'window.config.graphics = {};\n'
    99 'window.config.accordions = {};\n'
    100 '\n'
    101 "window.addEvent('domready', function(){\n"
    102 '});\n'
    103 '</script>\n'
    104 '<script type="text/javascript">\n'
    105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
    106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
    107 '\t\t\t\tvar sas_formatids = "8968";\n'
    108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
    109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
    110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
    111 "\twindow.addEvent('domready', function(){\r\n"
    112 'sas_move(1,8968);\t});\r\n'
    113 '</script>\n'
    114 '<script type="text/javascript">\n'
    115 'var _gaq = _gaq || [];\n'
    116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
    117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
    118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
    119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
    120 "_gaq.push(['_trackPageLoadTime']);\n"
    121 "_gaq.push(['_trackPageview']);\n"
    122 '(function() {\n'
    123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
    124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
    125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
    126 '})();\n'
    127 '</script>\n'
    128 '</body>\n'
    129 '</html>'
    
    
    
    [('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]
    

    I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.

    update 3

    SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:

    regx = re.compile(b'\t\t\t\t\t......)

    that is to say with b before the string, otherwise you'll get an error like in this question

    0 讨论(0)
  • 2021-01-06 07:30

    What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
    I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).

    import httplib2
    folder = httplib2.Http('.cache')
    response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
    

    print(response) shows us the response from the server:
    {'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

    While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.

    The actual content lives in, you guessed it, content. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):
    b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n\t"http://

    Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.

    0 讨论(0)
提交回复
热议问题