I\'m looking for a Python technique to build a nested JSON file from a flat table in a pandas data frame. For example how could a pandas data frame table such as:
With some input from @root I used a different tack and came up with the following code, which seems to get most of the way there:
import pandas
import json
from collections import defaultdict
inputExcel = 'E:\\teamsMM.xlsx'
exportJson = 'E:\\teamsMM.json'
data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
grouped = data.groupby(['teamname', 'members']).first()
results = defaultdict(lambda: defaultdict(dict))
for t in grouped.itertuples():
for i, key in enumerate(t.Index):
if i ==0:
nested = results[key]
elif i == len(t.Index) -1:
nested[key] = t
else:
nested = nested[key]
formattedJson = json.dumps(results, indent = 4)
formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'
parsed = open(exportJson, "w")
parsed.write(formattedJson)
The resulting JSON file is this:
{
"teams": [
{
"1": {
"0": [
[
1,
0
],
"John",
"Doe",
"Anon",
"916-555-1234",
"none",
"john.doe@wildlife.net"
],
"1": [
[
1,
1
],
"Jane",
"Doe",
"Anon",
"916-555-4321",
"916-555-7890",
"jane.doe@wildlife.net"
]
},
"2": {
"0": [
[
2,
0
],
"Mickey",
"Moose",
"Moosers",
"916-555-0000",
"916-555-1111",
"mickey.moose@wildlife.net"
],
"1": [
[
2,
1
],
"Minny",
"Moose",
"Moosers",
"916-555-2222",
"none",
"minny.moose@wildlife.net"
]
}
}
]
}
This format is very close to the desired end product. Remaining issues are: removing the redundant array [1, 0] that appears just above each firstname, and getting the headers for each nest to be "teamname": "1", "members": rather than "1": "0":
Also, I do not know why each record is being stripped of its heading on the conversion. For instance why is dictionary entry "firstname":"John" exported as "John".
This is the a solution that works and creates the desired JSON format. First, I grouped my dataframe by the appropriate columns, then instead of creating a dictionary (and losing data order) for each column heading/record pair, I created them as lists of tuples, then transformed the list into an Ordered Dict. Another Ordered Dict was created for the two columns that everything else was grouped by. Precise layering between lists and ordered dicts was necessary to for the JSON conversion to produce the correct format. Also note that when dumping to JSON, sort_keys must be set to false, or all your Ordered Dicts will be rearranged into alphabetical order.
import pandas
import json
from collections import OrderedDict
inputExcel = 'E:\\teams.xlsx'
exportJson = 'E:\\teams.json'
data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
# This creates a tuple of column headings for later use matching them with column data
cols = []
columnList = list(data[0:])
for col in columnList:
cols.append(str(col))
columnList = tuple(cols)
#This groups the dataframe by the 'teamname' and 'members' columns
grouped = data.groupby(['teamname', 'members']).first()
#This creates a reference to the index level of the groups
groupnames = data.groupby(["teamname", "members"]).grouper.levels
tm = (groupnames[0])
#Create a list to add team records to at the end of the first 'for' loop
teamsList = []
for teamN in tm:
teamN = int(teamN) #added this in to prevent TypeError: 1 is not JSON serializable
tempList = [] #Create an temporary list to add each record to
for index, row in grouped.iterrows():
dataRow = row
if index[0] == teamN: #Select the record in each row of the grouped dataframe if its index matches the team number
#In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
rowDict = OrderedDict(rowDict)
tempList.append(rowDict)
#Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
t = ([('teamname', str(teamN)), ('members', tempList)])
t= OrderedDict(t)
#Append the Ordered Dict to the emepty list of teams created earlier
ListX = t
teamsList.append(ListX)
#Create a final dictionary with a single item: the list of teams
teams = {"teams":teamsList}
#Dump to JSON format
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
print formattedJson
#Export to JSON file
parsed = open(exportJson, "w")
parsed.write(formattedJson)
print"\n\nExport to JSON Complete"