Given a Dataframe:
+---+-----------+---------+-------+------------+
| id| score|tx_amount|isValid| greeting|
+---+-----------+---------+-------+---------
Sample DataFrame:
df.show()
df.printSchema()
+---+-----------+---------+-------+------------+
| id|model_score|tx_amount|isValid| greeting|
+---+-----------+---------+-------+------------+
| 1| 0.2| 23.78| true| hello_world|
| 2| 0.6| 12.41| false|byebye_world|
+---+-----------+---------+-------+------------+
root
|-- id: integer (nullable = true)
|-- model_score: double (nullable = true)
|-- tx_amount: double (nullable = true)
|-- isValid: boolean (nullable = true)
|-- greeting: string (nullable = true)
I tried to keep it dynamic for any input of columns. It will take type from df.dtypes[1:]
because id
is not included in col_value
that is why skipping it(1:)
. Array
only accepts same type
in it, thats why we will convert all cols to string before applying the logic. I think it should work for your use case. You can build your Y/N
cols from here.
df.select([F.col(c).cast("string") for c in df.columns])\
.withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(x[0],F.lit(x[1]),F.lit(x[0]))\
for x in df.dtypes[1:]]))))\
.select("id", F.col("cols.*")).withColumn("col_value", F.element_at("0",1))\
.withColumn("col_type", F.element_at("0",2))\
.withColumn("col_name", F.element_at("0",3)).drop("0").show()
+---+------------+--------+-----------+
| id| col_value|col_type| col_name|
+---+------------+--------+-----------+
| 1| 0.2| double|model_score|
| 1| 23.78| double| tx_amount|
| 1| true| boolean| isValid|
| 1| hello_world| string| greeting|
| 2| 0.6| double|model_score|
| 2| 12.41| double| tx_amount|
| 2| false| boolean| isValid|
| 2|byebye_world| string| greeting|
+---+------------+--------+-----------+