When screen-scraping some website, I extract data from tags.
The data I get is not in standard JSON
format. I cannot use j
import demjson
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = demjson.decode(js_obj)
import json, _jsonnet
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = json.loads(_jsonnet.evaluate_snippet('snippet', js_obj))
import ast
# from
js_obj = "{'x':1, 'y':2, 'z':3}"
# to
py_obj = ast.literal_eval(js_obj)
If you have node
available on the system, you can ask it to evaluate the javascript expression for you, and print the stringified result. The resulting JSON can then be fed to json.loads
:
def evaluate_javascript(s):
"""Evaluate and stringify a javascript expression in node.js, and convert the
resulting JSON to a Python object"""
node = Popen(['node', '-'], stdin=PIPE, stdout=PIPE)
stdout, _ = node.communicate(f'console.log(JSON.stringify({s}))'.encode('utf8'))
return json.loads(stdout.decode('utf8'))
This will likely not work everywhere, but as a start, here's a simple regex that should convert the keys into quoted strings so you can pass into json.loads. Or is this what you're already doing?
In[70] : quote_keys_regex = r'([\{\s,])(\w+)(:)'
In[71] : re.sub(quote_keys_regex, r'\1"\2"\3', js_obj)
Out[71]: '{"x":1, "y":2, "z":3}'
In[72] : js_obj_2 = '{x:1, y:2, z:{k:3,j:2}}'
Int[73]: re.sub(quote_keys_regex, r'\1"\2"\3', js_obj_2)
Out[73]: '{"x":1, "y":2, "z":{"k":3,"j":2}}'
Not including objects
json.loads()
doesn't accept undefined, you have to change to nulljson.loads()
only accept double quotes
{"foo": 1, "bar": null}
Use this if you are sure that your javascript code only have double quotes on key names.
import json
json_text = """{"foo": 1, "bar": undefined}"""
json_text = re.sub(r'("\s*:\s*)undefined(\s*[,}])', '\\1null\\2', json_text)
py_obj = json.loads(json_text)
ast.literal_eval()
doesn't accept undefined, you have to change to Noneast.literal_eval()
doesn't accept null, you have to change to Noneast.literal_eval()
doesn't accept true, you have to change to Trueast.literal_eval()
doesn't accept false, you have to change to Falseast.literal_eval()
accept single and double quotes
{"foo": 1, "bar": None}
or {'foo': 1, 'bar': None}
import ast
js_obj = """{'foo': 1, 'bar': undefined}"""
js_obj = re.sub(r'([\'\"]\s*:\s*)undefined(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)null(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)NaN(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)true(\s*[,}])', '\\1True\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)false(\s*[,}])', '\\1False\\2', js_obj)
py_obj = ast.literal_eval(js_obj)