PySpark - Explode columns into rows based on the type of the column

后端 未结 2 2009
情歌与酒
情歌与酒 2021-01-27 06:12

Given a Dataframe:

+---+-----------+---------+-------+------------+
| id|      score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+---------         


        
2条回答
  •  不知归路
    2021-01-27 06:41

    Sample DataFrame:

    df.show()
    df.printSchema()
    
    +---+-----------+---------+-------+------------+
    | id|model_score|tx_amount|isValid|    greeting|
    +---+-----------+---------+-------+------------+
    |  1|        0.2|    23.78|   true| hello_world|
    |  2|        0.6|    12.41|  false|byebye_world|
    +---+-----------+---------+-------+------------+
    
    root
     |-- id: integer (nullable = true)
     |-- model_score: double (nullable = true)
     |-- tx_amount: double (nullable = true)
     |-- isValid: boolean (nullable = true)
     |-- greeting: string (nullable = true)
    

    I tried to keep it dynamic for any input of columns. It will take type from df.dtypes[1:] because id is not included in col_value that is why skipping it(1:). Array only accepts same type in it, thats why we will convert all cols to string before applying the logic. I think it should work for your use case. You can build your Y/N cols from here.

    df.select([F.col(c).cast("string") for c in df.columns])\
            .withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(x[0],F.lit(x[1]),F.lit(x[0]))\
                                                        for x in df.dtypes[1:]]))))\
            .select("id", F.col("cols.*")).withColumn("col_value", F.element_at("0",1))\
                                          .withColumn("col_type", F.element_at("0",2))\
                                          .withColumn("col_name", F.element_at("0",3)).drop("0").show()
    
    +---+------------+--------+-----------+
    | id|   col_value|col_type|   col_name|
    +---+------------+--------+-----------+
    |  1|         0.2|  double|model_score|
    |  1|       23.78|  double|  tx_amount|
    |  1|        true| boolean|    isValid|
    |  1| hello_world|  string|   greeting|
    |  2|         0.6|  double|model_score|
    |  2|       12.41|  double|  tx_amount|
    |  2|       false| boolean|    isValid|
    |  2|byebye_world|  string|   greeting|
    +---+------------+--------+-----------+
    

提交回复
热议问题