问题
I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:
dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")
Which gives me table with columns of each json element as below:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?
回答1:
Implemented solution similar to the below snippet
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
in above newRowDF can also be created as below if data type has to be enforced
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)
来源:https://stackoverflow.com/questions/54030601/how-to-unpivot-columns-into-rows-in-aws-glue-py-spark-script