Spark DataFrame handing empty String in OneHotEncoder

前端 未结 3 1301
醉梦人生
醉梦人生 2021-01-18 03:54

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the applic

相关标签:
3条回答
  • 2021-01-18 04:22

    Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.

    df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
    df = df.na.replace('', 'EMPTY', 'category')
    df.show()
    
    +---+--------+
    | id|category|
    +---+--------+
    |  0|       a|
    |  1|       b|
    |  2|       c|
    |  3|   EMPTY|
    |  4|       a|
    |  5|       c|
    +---+--------+
    
    0 讨论(0)
  • 2021-01-18 04:41

    Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :

    java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32) [...]

    This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)

    I'll create an UDF to process the empty category :

    import org.apache.spark.sql.functions._
    
    def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }
    

    Then, I'll apply the UDF on the column :

    val df = sqlContext.createDataFrame(Seq(
       (0, "a"),
       (1, "b"),
       (2, "c"),
       (3, ""),         //<- original example has "a" here
       (4, "a"),
       (5, "c")
    )).toDF("id", "category")
      .withColumn("category",processMissingCategory('category))
    
    df.show
    // +---+--------+
    // | id|category|
    // +---+--------+
    // |  0|       a|
    // |  1|       b|
    // |  2|       c|
    // |  3|      NA|
    // |  4|       a|
    // |  5|       c|
    // +---+--------+
    

    Now, you can go back to your transformations

    val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
    val indexed = indexer.transform(df)
    indexed.show
    // +---+--------+-------------+
    // | id|category|categoryIndex|
    // +---+--------+-------------+
    // |  0|       a|          0.0|
    // |  1|       b|          2.0|
    // |  2|       c|          1.0|
    // |  3|      NA|          3.0|
    // |  4|       a|          0.0|
    // |  5|       c|          1.0|
    // +---+--------+-------------+
    
    // Spark <2.3
    // val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
    // Spark +2.3
    val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
    val encoded = encoder.transform(indexed)
    
    encoded.show
    // +---+--------+-------------+-------------+
    // | id|category|categoryIndex|  categoryVec|
    // +---+--------+-------------+-------------+
    // |  0|       a|          0.0|(3,[0],[1.0])|
    // |  1|       b|          2.0|(3,[2],[1.0])|
    // |  2|       c|          1.0|(3,[1],[1.0])|
    // |  3|      NA|          3.0|    (3,[],[])|
    // |  4|       a|          0.0|(3,[0],[1.0])|
    // |  5|       c|          1.0|(3,[1],[1.0])|
    // +---+--------+-------------+-------------+
    

    EDIT:

    @Anthony 's solution in Scala :

    df.na.replace("category", Map( "" -> "NA")).show
    // +---+--------+
    // | id|category|
    // +---+--------+
    // |  0|       a|
    // |  1|       b|
    // |  2|       c|
    // |  3|      NA|
    // |  4|       a|
    // |  5|       c|
    // +---+--------+
    

    I hope this helps!

    0 讨论(0)
  • 2021-01-18 04:47

    if the column contains null the OneHotEncoder fails with a NullPointerException. therefore i extended the udf to tanslate null values as well

    object OneHotEncoderExample {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
    
        // $example on$
        val df1 = sqlContext.createDataFrame(Seq(
          (0.0, "a"),
          (1.0, "b"),
          (2.0, "c"),
          (3.0, ""),
          (4.0, null),
          (5.0, "c")
        )).toDF("id", "category")
    
    
        import org.apache.spark.sql.functions.udf
        def emptyValueSubstitution = udf[String, String] {
          case "" => "NA"
          case null => "null"
          case value => value
        }
        val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )
    
    
        val indexer = new StringIndexer()
          .setInputCol("category")
          .setOutputCol("categoryIndex")
          .fit(df)
        val indexed = indexer.transform(df)
        indexed.show()
    
        val encoder = new OneHotEncoder()
          .setInputCol("categoryIndex")
          .setOutputCol("categoryVec")
          .setDropLast(false)
        val encoded = encoder.transform(indexed)
        encoded.show()
        // $example off$
        sc.stop()
      }
    }
    
    0 讨论(0)
提交回复
热议问题