How to scrape charts from a website with python?

前端 未结 2 1253
我寻月下人不归
我寻月下人不归 2021-01-07 15:10

EDIT:

So I have save the script codes below to a text file but using re to extract the data still doesn\'t return me anything. My code is:

相关标签:
2条回答
  • 2021-01-07 15:51

    Another way is to use Highcharts' JavaScript Library as one would in the console and pull that using Selenium.

    import time
    from selenium import webdriver
    
    website = ""
    
    driver = webdriver.Firefox()
    driver.get(website)
    time.sleep(5)
    
    temp = driver.execute_script('return window.Highcharts.charts[0]'
                                 '.series[0].options.data')
    data = [item[1] for item in temp]
    print(data)
    

    Depending on what chart and series you are trying to pull your case might be slightly different.

    0 讨论(0)
  • 2021-01-07 16:07

    I'd go a combination of regex and yaml parser. Quick and dirty below - you may need to tweek the regex but it works with example:

    import re
    import sys
    import yaml
    
    chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
            re.MULTILINE | re.DOTALL)
    
    script = sys.stdin.read()
    
    m = chart_matcher.findall(script)
    
    for name, data in m:
        print name
        try:
            chart = yaml.safe_load(data)
            print "categories:", chart['xAxis'][0]['categories']
            print "data:", chart['series'][0]['data']
        except Exception, e:
            print e
    

    Requires the yaml library (pip install PyYAML) and you should use BeautifulSoup to extract the correct <script> tag before passing it to the regex.

    EDIT - full example

    Sorry I didn't make myself clear. You use BeautifulSoup to parse the HTML and extract the <script> elements, and then use PyYAML to parse the javascript object declaration. You can't use the built in json library because its not valid JSON but plain javascript object declarations (ie with no functions) are a subset of YAML.

    from bs4 import BeautifulSoup
    import yaml
    import re
    
    file_object = open('source_test_script.txt', mode="r")
    soup = BeautifulSoup(file_object, "html.parser")
    
    pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
    
    charts = {}
    
    # find every <script> tag in the source using beautifulsoup
    for tag in soup.find_all('script'):
    
        # tabs are special in yaml so remove them first
        script = tag.text.replace('\t', '')
    
        # find each object declaration
        for name, obj_declaration in pattern.findall(script):
            try:
                # parse the javascript declaration
                charts[name] = yaml.safe_load(obj_declaration)
            except Exception, e:
                print "Failed to parse {0}: {1}".format(name, e)
    
    # extract the data you want
    for name in charts:
        print "## {0} ##".format(name);
        print "categories:", charts[name]['xAxis'][0]['categories']
        print "data:", charts[name]['series'][0]['data']
        print
    

    Output:

    ## chart1 ##
    categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]
    

    Note I had to tweek the regex to make it handle the unicode output and whitespace from BeautifulSoup - in my original example I just piped your source directly to the regex.

    EDIT 2 - no yaml

    Given that the javascript looks to be partially generated the best you can hope for is to grab the lines - not elegant but will probably work for you.

    from bs4 import BeautifulSoup
    import json
    import re
    
    file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
    soup = BeautifulSoup(file_object, "html.parser")
    
    pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
    
    charts = {}
    
    for tag in soup.find_all('script'):
    
        # tabs are special in yaml so remove them first
        script = tag.text
    
        values = {}
    
        # find each object declaration
        for name, obj_declaration in pattern.findall(script):
            for line in obj_declaration.split('\n'):
                line = line.strip('\t\n ,;')
                for field in ('data', 'categories'):
                    if line.startswith(field + ":"):
                        data = line[len(field)+1:]
                        try:
                            values[field] = json.loads(data)
                        except:
                            print "Failed to parse %r for %s" % (data, name)
    
            charts[name] = values
    
    print charts
    

    Note that it fails for chart7 because that references another variable.

    0 讨论(0)
提交回复
热议问题