How to execute OpenRefine JSON on CSV in Python?

北战南征 提交于 2019-12-06 09:47:34

问题


I am trying to find a Python solution which can execute the following OpenRefine Python commands in JSON without OpenRefine server being on. My OpenRefine JSON contains mappings and custom Python commands on each field of any properly formatted CSV file, so this is not a basic JSON reading. One example OpenRefine JSON code where only regex mappings

[
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column Sleep using expression jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Sleep",
    "expression": "jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  }
]

One solution is to process JSON one by one with each type of element but there may be easier solutions with some packages.

Python: 3.5.2
OS: Debian 9


回答1:


The pyrefine project aims to do exactly that. But it is still a work in progress, very few operations are supported. Contributors are welcome!




回答2:


I do not know any ready-made solution and I have no idea how to do that. A workaround might be to try to transform the Jython scripts into functions, and then apply them to your csv using pandas.

import json
import re

with open("your open refine json", "r") as infile:
    data = json.load(infile)

with open("result.py", 'w') as outfile:
    for el in data:
        count = 1
        column=el['columnName']
        expression = el['expression'].replace("jython:", "")
        expression = re.sub(r"\n\n\s?\n?", "\n    ", expression)
        expression = re.sub(r";$", "", expression)
        result = """def function%s(value):\n    %s""" %(count, expression)
        count+=1
        outfile.write(result + 
                      "\n\n" + 
                      "mycsv['%s'] = mycsv['%s'].apply(lambda x: function%s(x))" %(column, column, count-1))

result on your Json:

def function1(value):
    import re
    value = re.sub("h0", "h",value)
    value = re.sub("h","*60+", value)
    return eval(value)
    return eval(value.replace("h", "*60+")) # is this second return an error ?

mycsv['Sleep'] = mycsv['Sleep'].apply(lambda x: function1(x))

Finally, you can add these lines on the top of result.py, launch the script, and pray...

import pandas as pd 

mycsv = pd.read_csv("your csv", encoding = "utf8")



回答3:


I have found variety of different alternatives. I try to judge them to other alternatives. Pyrefine is so far the only genuine Python solution.

Alternatives

I. A partial solution here to create a dictionary in R with Python to do the conversions. This does not implement GREPL edits, Jython/Python edits or Closure edits.

#!/usr/bin/env python2
#
# Description
#      This builds a dictionary-style structure to R with Python
#      to do the JSON edits on other data, with only `Cluster-edit` support.    
#
#      The original source is (1) on which I have done some remarks.
#
# Further reading
#
# (1) Original source of the code, https://medium.com/optima-blog/semi-automated-text-cleaning-in-r-68054a9491da
import json
import sys
import os
if len(sys.argv) < 2:
 print “USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R”
 exit(1)
json_file = sys.argv[-1]
#conversions = json.load(open(“state_clustering.json”))
conversions = json.load(open(json_file))
function_name = os.path.splitext(os.path.basename(json_file))[0]
print “%s = function(df) {“ %function_name
for conv in conversions:

  #THIS WILL fire ERRORS WITHOUT try-catch eg. with regexes
  edits = conv[‘edits’]

  columnName = str(conv[‘columnName’])
  for edit in edits:
    froms = edit[‘from’]
    to = edit[‘to’]
    for source in froms:
      source = str(source)
      to = str(to)
      print “ df[df[, %s] == %s, %s] = %s” %(repr(columnName),
        repr(source), repr(columnName), repr(to))
print “ df”
print “}”

where the output could be edited to Python format.

II. P3-batchrefine is mostly coded in Java but some Python. It lets you do the transformations in the following way (not a genuine Python solution unless you are fine in calling external Java libraries).

./bin/batchrefine remote input.csv transform.json > output.csv

III. Pyrefine is a genuine python solution and it aims to work in the following way, copied from its docs:

import pyrefine
import pandas as pd

with open('script.json') as script_file:
    script = pyrefine.parse(script_file)

input_data = pd.read_csv('input.csv')
output_data = script.execute(input_data)

Further Information on parsing OpenRefine JSON

  1. Trying to parse a Json with Open Refine GREL

  2. Parse JSON in Google Refine

  3. Best way to parse a big and intricated Json file with OpenRefine (or R)




回答4:


Here are some additional projects:

  • https://github.com/PaulMakepeace/refine-client-py : This is a little old though
  • https://github.com/opencultureconsulting/openrefine-client : This is a fork of the previous project. It extends the command line interface (CLI) and supports communication between docker containers


来源:https://stackoverflow.com/questions/45388524/how-to-execute-openrefine-json-on-csv-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!