Getting data from a chart that is displayed on a website

后端 未结 1 683
梦如初夏
梦如初夏 2021-02-06 06:21

I was asked to draw a graph like this one

\"enter

using Latex (more precisely, tik

相关标签:
1条回答
  • 2021-02-06 06:55

    Well, it'd be great if Google provided an API for this data! That said, you can still scrape some data out of the site. Here's how to go about it...

    Install Firebug

    I prefer Firebug for Firefox, but Chrome's developer tools should also work.

    Investigate First things first, let's visit the url in question and use Firebug try and see what's going on. Activate Firebug with F12 or go to Tools->Firebug->Open Firebug. Click on the Net tab first and reload the page. This shows all the requests made, and will give you some insight into how the site works. Usually flash plugins load data externally, as opposed to having it embedded in the actual plugin, and if you look at the requests you'll see request labeled POST service. If you hover over it, firebug shows the full url and you'll see the page made a request to http://www.google.com/transparencyreport/traffic/service. You can click on the request and look at the headers sent, the post data, the response and cookies used to perform the request.

    Request detail

    If you look at the response, you'll see what appears to be malformed JSON. From what I can tell this appears to contain the list of normalized traffic data points. You could actually cut and paste the response out of firebug, but since this IS a python question, let's work a bit harder.

    Getting the data into Python

    To make the post request successfully, we'll need to do (nearly) everything the browser does. We can cheat a bit and just copy the request headers and post data out of firebug, to spoof a real request.

    Headers & post data

    Use triple quotes to paste multi-line strings into the shell. Copy the request headers and paste it in. Request Headers

    >>> headers = """ <paste headers> """
    

    Next convert it to a dict for httplib2. I'm going to use a list comprehension (which splits the string based on newlines, then splits the line on the first : and strips trailing whitespace, which gives me a list of two-elemnt lists that dict can convert into a dictionary), but you could do this however you want. You could manually create the dict too, I just find this faster.

    >>> headers = dict([[s.strip() for s in line.split(':', 1)]
                                   for line in headers.strip().split('\n')])
    

    And copy in the post data. Copy post data used for the chart we are interested in

    >>> body = """ <paste post data> """
    

    Make the request I'm going to use httplib2 but there are a few other http clients and some nice tools for scraping the web like mechanize and scrapy. We'll make the POST request using the url to the API, the headers we copied and the post data we copied from firebug. The request returns a tuple of response headers and content.

    >>> import httplib2 
    >>> h = httplib2.Http()
    >>> url = 'http://www.google.com/transparencyreport/traffic/service'
    >>> resp, content = h.request(url, 'POST', body=body, headers=headers)
    

    Massage Data

    The original format is really weird and only the top bit seems to contain the data points, so I'll ditch the rest.

    >>> cleaned = content.split("'")[0][4:-1] + ']' 
    

    Now that it's valid JSON, so we can deserialize it into native python data types.

    >>> import json
    >>> data = json.loads(cleaned)
    

    All of the points I'm interested in are floats, so I'll filter based on that.

    >>> data = [x for x in data if type(x) == float]
    

    Process/Save Data

    Now that we have our data, inspect it, do additional processing, etc...

    >>> data[:5] 
    <<< 
    [44.73874282836914,
     45.4061279296875,
     47.5350456237793,
     44.56114196777344,
     46.08817672729492]
    

    ...or just save it.

    >>> with open('data.json', 'w') as f:
    ...:     f.write(json.dumps(data))
    

    We could also plot it out using pyplot from matplotlib (or some other graphing/plotting library).

    >>> import matplotlib.pyplot as plt
    >>> plt.plot(data)
    

    Pyplot

    Conclusion

    If you are just interested in a few things you can adjust the chart to display what you want and then use the request headers/post data used by the proper request to http://www.google.com/transparencyreport/traffic/service. You'll might want to inspect the actual response closer than I did, I just discarded the parts that didn't make sense to me. Hopefully they'll expose a public API for this data.

    0 讨论(0)
提交回复
热议问题