Factorize Spark column

前端 未结 2 638
抹茶落季
抹茶落季 2021-01-07 02:31

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.

Example, the orig

相关标签:
2条回答
  • 2021-01-07 02:51

    You can use an user defined function.

    First you create the mapping you need:

    val updateFunction = udf {(x: String) =>
      x match {
        case "A" => 0
        case "B" => 1
        case "C" => 2
        case _ => 3
      }
    }
    

    And now you only have to apply it to your DataFrame:

    df.withColumn("col3", updateFunction(df.col("col3")))
    
    0 讨论(0)
  • 2021-01-07 03:03

    You can use StringIndexer to encode letters into indices:

    import org.apache.spark.ml.feature.StringIndexer
    
    val indexer = new StringIndexer()
      .setInputCol("col3")
      .setOutputCol("col3Index")
    
    val indexed = indexer.fit(df).transform(df)
    indexed.show()
    
    +----------+----------------+----+---------+
    |      col1|            col2|col3|col3Index|
    +----------+----------------+----+---------+
    |1473490929|4060600988513370|   A|      0.0|
    |1473492972|4060600988513370|   A|      0.0|
    |1473509764|4060600988513370|   B|      1.0|
    |1473513432|4060600988513370|   C|      2.0|
    |1473513432|4060600988513370|   A|      0.0|
    +----------+----------------+----+---------+
    

    Data:

    val df = spark.createDataFrame(Seq(
                  (1473490929, "4060600988513370", "A"),
                  (1473492972, "4060600988513370", "A"),  
                  (1473509764, "4060600988513370", "B"),
                  (1473513432, "4060600988513370", "C"),
                  (1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
    
    0 讨论(0)
提交回复
热议问题