PySpark - Explode columns into rows based on the type of the column

后端 未结 2 2010
情歌与酒
情歌与酒 2021-01-27 06:12

Given a Dataframe:

+---+-----------+---------+-------+------------+
| id|      score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+---------         


        
相关标签:
2条回答
  • 2021-01-27 06:36

    you can try several unions :

    
    df = df.select(
        "id",
        F.col("score").cast("string").alias("col_value"),
        F.lit("Y").alias("is_score"),
        F.lit("N").alias("is_amount"),
        F.lit("N").alias("is_boolean"),
        F.lit("N").alias("is_text"),
    ).union(df.select(
        "id",
        F.col("tx_amount").cast("string").alias("col_value"),
        F.lit("N").alias("is_score"),
        F.lit("Y").alias("is_amount"),
        F.lit("N").alias("is_boolean"),
        F.lit("N").alias("is_text"),
    )).union(...) # etc
    
    
    0 讨论(0)
  • 2021-01-27 06:41

    Sample DataFrame:

    df.show()
    df.printSchema()
    
    +---+-----------+---------+-------+------------+
    | id|model_score|tx_amount|isValid|    greeting|
    +---+-----------+---------+-------+------------+
    |  1|        0.2|    23.78|   true| hello_world|
    |  2|        0.6|    12.41|  false|byebye_world|
    +---+-----------+---------+-------+------------+
    
    root
     |-- id: integer (nullable = true)
     |-- model_score: double (nullable = true)
     |-- tx_amount: double (nullable = true)
     |-- isValid: boolean (nullable = true)
     |-- greeting: string (nullable = true)
    

    I tried to keep it dynamic for any input of columns. It will take type from df.dtypes[1:] because id is not included in col_value that is why skipping it(1:). Array only accepts same type in it, thats why we will convert all cols to string before applying the logic. I think it should work for your use case. You can build your Y/N cols from here.

    df.select([F.col(c).cast("string") for c in df.columns])\
            .withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(x[0],F.lit(x[1]),F.lit(x[0]))\
                                                        for x in df.dtypes[1:]]))))\
            .select("id", F.col("cols.*")).withColumn("col_value", F.element_at("0",1))\
                                          .withColumn("col_type", F.element_at("0",2))\
                                          .withColumn("col_name", F.element_at("0",3)).drop("0").show()
    
    +---+------------+--------+-----------+
    | id|   col_value|col_type|   col_name|
    +---+------------+--------+-----------+
    |  1|         0.2|  double|model_score|
    |  1|       23.78|  double|  tx_amount|
    |  1|        true| boolean|    isValid|
    |  1| hello_world|  string|   greeting|
    |  2|         0.6|  double|model_score|
    |  2|       12.41|  double|  tx_amount|
    |  2|       false| boolean|    isValid|
    |  2|byebye_world|  string|   greeting|
    +---+------------+--------+-----------+
    
    0 讨论(0)
提交回复
热议问题