Parsing a JSON string which was loaded from a CSV using Pandas

后端 未结 5 938
野的像风
野的像风 2020-11-27 03:24

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:



        
相关标签:
5条回答
  • 2020-11-27 04:04

    Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)

    We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.

    def CustomParser(data):
      import json
      j1 = json.loads(data)
      return j1
    
    df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
    df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
    
    0 讨论(0)
  • 2020-11-27 04:06

    json_normalize function in pandas.io.json package helps to do this without using custom function.

    (assuming you are loading the data from a file)

    from pandas.io.json import json_normalize
    df = pd.read_csv(file_path, header=None)
    stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
    stats_df.set_index(df.index, inplace=True)
    df.join(stats_df)
    del df.drop(df.columns[2], inplace=True)
    
    0 讨论(0)
  • 2020-11-27 04:08

    I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:

    stdf = df['stats'].apply(json.loads)
    pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
    

    or alternatively in one step:

    df.join(df['stats'].apply(json.loads).apply(pd.Series))
    
    0 讨论(0)
  • 2020-11-27 04:12

    There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv

    converters : dict. optional
    
    Dict of functions for converting values in certain columns. Keys can either be integers or column labels
    

    So first define your custom parser. In this case the below should work:

    def CustomParser(data):
        import json
        j1 = json.loads(data)
        return j1
    

    In your case you'll have something like:

    df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
    

    We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict

    From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)

    df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
    

    On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.

    0 讨论(0)
  • 2020-11-27 04:25

    Option 1

    If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:

    import json
    import pandas as pd
    
    df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
    

    Option 2

    If you didn't then you might need to use this:

    import json
    import pandas as pd
    
    df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
    

    Option 3

    For more complicated situations you can write a custom converter like this:

    import json
    import pandas as pd
    
    def parse_column(data):
        try:
            return json.loads(data)
        except Exception as e:
            print(e)
            return None
    
    
    df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
    
    0 讨论(0)
提交回复
热议问题