How to scrape charts from a website with python?

前端未结

关注

 2  1253

EDIT:

So I have save the script codes below to a text file but using re to extract the data still doesn\'t return me anything. My code is:

相关标签:

2条回答

旧巷少年郎

2021-01-07 15:51

Another way is to use Highcharts' JavaScript Library as one would in the console and pull that using Selenium.

import time
from selenium import webdriver

website = ""

driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item[1] for item in temp]
print(data)

Depending on what chart and series you are trying to pull your case might be slightly different.

0 讨论(0)

深忆病人

2021-01-07 16:07

I'd go a combination of regex and yaml parser. Quick and dirty below - you may need to tweek the regex but it works with example:

import re
import sys
import yaml

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
        re.MULTILINE | re.DOTALL)

script = sys.stdin.read()

m = chart_matcher.findall(script)

for name, data in m:
    print name
    try:
        chart = yaml.safe_load(data)
        print "categories:", chart['xAxis'][0]['categories']
        print "data:", chart['series'][0]['data']
    except Exception, e:
        print e

Requires the yaml library (pip install PyYAML) and you should use BeautifulSoup to extract the correct <script> tag before passing it to the regex.

EDIT - full example

Sorry I didn't make myself clear. You use BeautifulSoup to parse the HTML and extract the <script> elements, and then use PyYAML to parse the javascript object declaration. You can't use the built in json library because its not valid JSON but plain javascript object declarations (ie with no functions) are a subset of YAML.

from bs4 import BeautifulSoup
import yaml
import re

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text.replace('\t', '')

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        try:
            # parse the javascript declaration
            charts[name] = yaml.safe_load(obj_declaration)
        except Exception, e:
            print "Failed to parse {0}: {1}".format(name, e)

# extract the data you want
for name in charts:
    print "## {0} ##".format(name);
    print "categories:", charts[name]['xAxis'][0]['categories']
    print "data:", charts[name]['series'][0]['data']
    print

Output:

## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

Note I had to tweek the regex to make it handle the unicode output and whitespace from BeautifulSoup - in my original example I just piped your source directly to the regex.

EDIT 2 - no yaml

Given that the javascript looks to be partially generated the best you can hope for is to grab the lines - not elegant but will probably work for you.

from bs4 import BeautifulSoup
import json
import re

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text

    values = {}

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        for line in obj_declaration.split('\n'):
            line = line.strip('\t\n ,;')
            for field in ('data', 'categories'):
                if line.startswith(field + ":"):
                    data = line[len(field)+1:]
                    try:
                        values[field] = json.loads(data)
                    except:
                        print "Failed to parse %r for %s" % (data, name)

        charts[name] = values

print charts

Note that it fails for chart7 because that references another variable.

0 讨论(0)