How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

问题

I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31).

{
 "2018" : {
    "Jan": {
        "1": {
            "u": 1,
            "n": 2
        }
        "2": {
            "u": 4,
            "n": 7
        }
    },
    "Feb": {
        "1": {
            "u": 3,
            "n": 2
        },
        "4": {
            "u": 4,
            "n": 5
        }
    }
 }
}

I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure:

dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table, transformation_ctx = "dfc")

Which gives me table with columns of each json element as below:

| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n | 
| 1            |      2       |      4       |      7       |      3       |      2       |      4       |      5       |

As you can see, there will be lot of column in the table for each day and each month. And, I want to simplify the table by converting columns into rows to have below table.

| year | month | dd | u | n | 
| 2018 | Jan   | 1  | 1 | 2 | 
| 2018 | Jan   | 2  | 4 | 7 |  
| 2018 | Feb   | 1  | 3 | 2 |  
| 2018 | Jan   | 4  | 4 | 5 |

With my search, I could not get right answer. Is there a solution AWS Glue/PySpark or any other way to accomplish unpivot function to get row based table from column based table? Can it be done in Athena ?

回答1:

Implemented solution similar to the below snippet

dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
    for colName in dataFrame.schema.names:
        value = row[colName]
        keyArray = colName.split('.')
        rowDataArray = []
        rowDataArray.insert(0,str(id))
        rowDataArray.insert(1,str(keyArray[0]))
        rowDataArray.insert(2,str(keyArray[1]))
        rowDataArray.insert(3,str(keyArray[2]))
        rowDataArray.insert(4,str(keyArray[3]))
        tableDataArray.insert(rowArrayCount,rowDataArray)
    rowArrayCount=+1

unpivotDF = None
for rowDataArray in tableDataArray:
    newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
    if unpivotDF is None:
        unpivotDF = newRowDF
    else :
        unpivotDF = unpivotDF.union(newRowDF)

datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")

in above newRowDF can also be created as below if data type has to be enforced

columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
    newRowDF = spark.createDataFrame(rowDataArray, schema)

来源：https://stackoverflow.com/questions/54030601/how-to-unpivot-columns-into-rows-in-aws-glue-py-spark-script

标签

pyspark

pivot

unpivot

amazon-athena

aws-glue