EDIT:
So I have save the script codes below to a text file but using re to extract the data still doesn\'t return me anything. My code is:
Another way is to use Highcharts' JavaScript Library as one would in the console and pull that using Selenium.
import time
from selenium import webdriver
website = ""
driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
data = [item[1] for item in temp]
print(data)
Depending on what chart and series you are trying to pull your case might be slightly different.
I'd go a combination of regex and yaml parser. Quick and dirty below - you may need to tweek the regex but it works with example:
import re
import sys
import yaml
chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
re.MULTILINE | re.DOTALL)
script = sys.stdin.read()
m = chart_matcher.findall(script)
for name, data in m:
print name
try:
chart = yaml.safe_load(data)
print "categories:", chart['xAxis'][0]['categories']
print "data:", chart['series'][0]['data']
except Exception, e:
print e
Requires the yaml library (pip install PyYAML
) and you should use BeautifulSoup to extract the correct <script>
tag before passing it to the regex.
EDIT - full example
Sorry I didn't make myself clear. You use BeautifulSoup to parse the HTML and extract the <script>
elements, and then use PyYAML to parse the javascript object declaration. You can't use the built in json library because its not valid JSON but plain javascript object declarations (ie with no functions) are a subset of YAML.
from bs4 import BeautifulSoup
import yaml
import re
file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
charts = {}
# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):
# tabs are special in yaml so remove them first
script = tag.text.replace('\t', '')
# find each object declaration
for name, obj_declaration in pattern.findall(script):
try:
# parse the javascript declaration
charts[name] = yaml.safe_load(obj_declaration)
except Exception, e:
print "Failed to parse {0}: {1}".format(name, e)
# extract the data you want
for name in charts:
print "## {0} ##".format(name);
print "categories:", charts[name]['xAxis'][0]['categories']
print "data:", charts[name]['series'][0]['data']
print
Output:
## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]
Note I had to tweek the regex to make it handle the unicode output and whitespace from BeautifulSoup - in my original example I just piped your source directly to the regex.
EDIT 2 - no yaml
Given that the javascript looks to be partially generated the best you can hope for is to grab the lines - not elegant but will probably work for you.
from bs4 import BeautifulSoup
import json
import re
file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
charts = {}
for tag in soup.find_all('script'):
# tabs are special in yaml so remove them first
script = tag.text
values = {}
# find each object declaration
for name, obj_declaration in pattern.findall(script):
for line in obj_declaration.split('\n'):
line = line.strip('\t\n ,;')
for field in ('data', 'categories'):
if line.startswith(field + ":"):
data = line[len(field)+1:]
try:
values[field] = json.loads(data)
except:
print "Failed to parse %r for %s" % (data, name)
charts[name] = values
print charts
Note that it fails for chart7 because that references another variable.