Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the orig
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) => x match { case "A" => 0 case "B" => 1 case "C" => 2 case _ => 3 } }
And now you only have to apply it to your DataFrame:
DataFrame
df.withColumn("col3", updateFunction(df.col("col3")))