How to build a JSON file with nested records from a flat data table?

前端 未结 2 1036
清歌不尽
清歌不尽 2021-01-12 10:49

I\'m looking for a Python technique to build a nested JSON file from a flat table in a pandas data frame. For example how could a pandas data frame table such as:

         


        
相关标签:
2条回答
  • 2021-01-12 11:30

    With some input from @root I used a different tack and came up with the following code, which seems to get most of the way there:

    import pandas
    import json
    from collections import defaultdict
    
    inputExcel = 'E:\\teamsMM.xlsx'
    exportJson = 'E:\\teamsMM.json'
    
    data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
    
    grouped = data.groupby(['teamname', 'members']).first()
    
    results = defaultdict(lambda: defaultdict(dict))
    
    for t in grouped.itertuples():
        for i, key in enumerate(t.Index):
            if i ==0:
                nested = results[key]
            elif i == len(t.Index) -1:
                nested[key] = t
            else:
                nested = nested[key]
    
    
    formattedJson = json.dumps(results, indent = 4)
    
    formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'
    
    parsed = open(exportJson, "w")
    parsed.write(formattedJson)
    

    The resulting JSON file is this:

    {
    "teams": [
    {
        "1": {
            "0": [
                [
                    1, 
                    0
                ], 
                "John", 
                "Doe", 
                "Anon", 
                "916-555-1234", 
                "none", 
                "john.doe@wildlife.net"
            ], 
            "1": [
                [
                    1, 
                    1
                ], 
                "Jane", 
                "Doe", 
                "Anon", 
                "916-555-4321", 
                "916-555-7890", 
                "jane.doe@wildlife.net"
            ]
        }, 
        "2": {
            "0": [
                [
                    2, 
                    0
                ], 
                "Mickey", 
                "Moose", 
                "Moosers", 
                "916-555-0000", 
                "916-555-1111", 
                "mickey.moose@wildlife.net"
            ], 
            "1": [
                [
                    2, 
                    1
                ], 
                "Minny", 
                "Moose", 
                "Moosers", 
                "916-555-2222", 
                "none", 
                "minny.moose@wildlife.net"
            ]
        }
    }
    ]
     }
    

    This format is very close to the desired end product. Remaining issues are: removing the redundant array [1, 0] that appears just above each firstname, and getting the headers for each nest to be "teamname": "1", "members": rather than "1": "0":

    Also, I do not know why each record is being stripped of its heading on the conversion. For instance why is dictionary entry "firstname":"John" exported as "John".

    0 讨论(0)
  • 2021-01-12 11:51

    This is the a solution that works and creates the desired JSON format. First, I grouped my dataframe by the appropriate columns, then instead of creating a dictionary (and losing data order) for each column heading/record pair, I created them as lists of tuples, then transformed the list into an Ordered Dict. Another Ordered Dict was created for the two columns that everything else was grouped by. Precise layering between lists and ordered dicts was necessary to for the JSON conversion to produce the correct format. Also note that when dumping to JSON, sort_keys must be set to false, or all your Ordered Dicts will be rearranged into alphabetical order.

    import pandas
    import json
    from collections import OrderedDict
    
    inputExcel = 'E:\\teams.xlsx'
    exportJson = 'E:\\teams.json'
    
    data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
    
    # This creates a tuple of column headings for later use matching them with column data
    cols = []
    columnList = list(data[0:])
    for col in columnList:
        cols.append(str(col))
    columnList = tuple(cols)
    
    #This groups the dataframe by the 'teamname' and 'members' columns
    grouped = data.groupby(['teamname', 'members']).first()
    
    #This creates a reference to the index level of the groups
    groupnames = data.groupby(["teamname", "members"]).grouper.levels
    tm = (groupnames[0])
    
    #Create a list to add team records to at the end of the first 'for' loop
    teamsList = []
    
    for teamN in tm:
        teamN = int(teamN)  #added this in to prevent TypeError: 1 is not JSON serializable
        tempList = []   #Create an temporary list to add each record to
        for index, row in grouped.iterrows():
            dataRow = row
            if index[0] == teamN:  #Select the record in each row of the grouped dataframe if its index matches the team number
    
                #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
                rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
                rowDict = OrderedDict(rowDict)
                tempList.append(rowDict)
        #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
        t = ([('teamname', str(teamN)), ('members', tempList)])
        t= OrderedDict(t)
    
        #Append the Ordered Dict to the emepty list of teams created earlier
        ListX = t
        teamsList.append(ListX)
    
    
    #Create a final dictionary with a single item: the list of teams
    teams = {"teams":teamsList} 
    
    #Dump to JSON format
    formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
    formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
    print formattedJson
    
    #Export to JSON file
    parsed = open(exportJson, "w")
    parsed.write(formattedJson)
    
    print"\n\nExport to JSON Complete"
    
    0 讨论(0)
提交回复
热议问题