Spark: OneHot encoder and storing Pipeline (feature dimension issue)

后端 未结 1 1355
孤城傲影
孤城傲影 2020-12-21 16:01

We have a pipeline (2.0.1) consisting of multiple feature transformation stages.

Some of these stages are OneHot encoders. Idea: classify an integer-based category i

相关标签:
1条回答
  • 2020-12-21 16:48

    Spark >= 2.3

    Spark 2.3 introduces OneHotEncoderEstimator (to be renamed as OneHotEncoder in Spark 3.0) which can be used directly, and supports multiple input columns.

    Spark < 2.3

    You don't use OneHotEncoder as it is intended to be used. OneHotEncoder is a Transofrmer not an Estimator. It doesn't store any information about the levels but depends on the Column metadata to determine output dimensions. If metadata is missing, like in your case, it uses fallback strategy and assumes there is max(input_column) levels. Serialization is irrelevant here.

    Typical usage involves Transformers in the upstream Pipeline, which set metadata for you. One common example is StringIndexer.

    It is still possible to set metadata manually, but it is more involved:

    import org.apache.spark.ml.attribute.NominalAttribute
    
    val meta = NominalAttribute.defaultAttr
      .withName("class")
      .withValues("0", (1 to 5).map(_.toString): _*)
      .toMetadata
    
    loadedModel.transform(df2.select($"class".as("class", meta), $"output"))
    

    Similarly in Python (needs Spark >= 2.2):

    from pyspark.sql.functions import col
    
    meta = {"ml_attr": {
        "vals": [str(x) for x in range(6)],   # Provide a set of levels
        "type": "nominal", 
        "name": "class"}}
    
    loaded.transform(
        df.withColumn("class", col("class").alias("class", metadata=meta))
    )
    

    Metadata can be also attached using a number of different methods: How to change column metadata in pyspark?.

    0 讨论(0)
提交回复
热议问题