I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additiona
The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.
pivot_cols = ['TYPE','CODE']
keys = ['ID','TYPE','CODE']
before = sc.parallelize([(1,'A','X1'),
(2,'B','X2'),
(3,'B','X3'),
(1,'B','X3'),
(2,'C','X2'),
(3,'C','X2'),
(1,'C','X1'),
(1,'B','X1')]).toDF(['ID','TYPE','CODE'])
#Helper function to recursively join a list of dataframes
#Can be simplified if you only need two columns
def join_all(dfs,keys):
if len(dfs) > 1:
return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
else:
return dfs[0]
dfs = []
combined = []
for pivot_col in pivot_cols:
pivotDF = before.groupBy(keys).pivot(pivot_col).count()
new_names = pivotDF.columns[:len(keys)] + ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]
df = pivotDF.toDF(*new_names).fillna(0)
combined.append(df)
join_all(combined,keys).show()
This gives as output:
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
First you need to collect distinct values of TYPES
and CODE
. Then either select add column with name of each value using withColumn
or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
The first step is to make a DataFrame
from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame
object called df
, and the columns are: [ID
, TYPE
, CODE
].
The rest van be fixed with DataFrame.withColumn()
and pyspark.sql.functions.when
:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)
I was looking for the same solution but is scala, maybe this will help someone:
val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))