Spark: OneHot encoder and storing Pipeline (feature dimension issue)

后端未结

关注

 1  1355

孤城傲影

We have a pipeline (2.0.1) consisting of multiple feature transformation stages.

Some of these stages are OneHot encoders. Idea: classify an integer-based category i

相关标签:

1条回答

无人及你

2020-12-21 16:48
Spark >= 2.3

Spark 2.3 introduces OneHotEncoderEstimator (to be renamed as OneHotEncoder in Spark 3.0) which can be used directly, and supports multiple input columns.

Spark < 2.3

You don't use OneHotEncoder as it is intended to be used. OneHotEncoder is a Transofrmer not an Estimator. It doesn't store any information about the levels but depends on the Column metadata to determine output dimensions. If metadata is missing, like in your case, it uses fallback strategy and assumes there is max(input_column) levels. Serialization is irrelevant here.

Typical usage involves Transformers in the upstream Pipeline, which set metadata for you. One common example is StringIndexer.

It is still possible to set metadata manually, but it is more involved:
```
import org.apache.spark.ml.attribute.NominalAttribute

val meta = NominalAttribute.defaultAttr
  .withName("class")
  .withValues("0", (1 to 5).map(_.toString): _*)
  .toMetadata

loadedModel.transform(df2.select($"class".as("class", meta), $"output"))
```
Similarly in Python (needs Spark >= 2.2):
```
from pyspark.sql.functions import col

meta = {"ml_attr": {
    "vals": [str(x) for x in range(6)],   # Provide a set of levels
    "type": "nominal", 
    "name": "class"}}

loaded.transform(
    df.withColumn("class", col("class").alias("class", metadata=meta))
)
```
Metadata can be also attached using a number of different methods: How to change column metadata in pyspark?.
0 讨论(0)
发布评论:

提交评论
- 加载中...