Factorize Spark column

前端未结

关注

 2  640

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.

Example, the orig

相关标签:

2条回答

被撕碎了的回忆

2021-01-07 02:51
You can use an user defined function.

First you create the mapping you need:
```
val updateFunction = udf {(x: String) =>
  x match {
    case "A" => 0
    case "B" => 1
    case "C" => 2
    case _ => 3
  }
}
```
And now you only have to apply it to your DataFrame:
```
df.withColumn("col3", updateFunction(df.col("col3")))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2021-01-07 03:03

You can use StringIndexer to encode letters into indices:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("col3")
  .setOutputCol("col3Index")

val indexed = indexer.fit(df).transform(df)
indexed.show()

+----------+----------------+----+---------+
|      col1|            col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370|   A|      0.0|
|1473492972|4060600988513370|   A|      0.0|
|1473509764|4060600988513370|   B|      1.0|
|1473513432|4060600988513370|   C|      2.0|
|1473513432|4060600988513370|   A|      0.0|
+----------+----------------+----+---------+

Data:

val df = spark.createDataFrame(Seq(
              (1473490929, "4060600988513370", "A"),
              (1473492972, "4060600988513370", "A"),  
              (1473509764, "4060600988513370", "B"),
              (1473513432, "4060600988513370", "C"),
              (1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")

0 讨论(0)