We have a pipeline (2.0.1) consisting of multiple feature transformation stages.
Some of these stages are OneHot encoders. Idea: classify an integer-based category i
Spark >= 2.3
Spark 2.3 introduces OneHotEncoderEstimator (to be renamed as OneHotEncoder
in Spark 3.0) which can be used directly, and supports multiple input columns.
Spark < 2.3
You don't use OneHotEncoder
as it is intended to be used. OneHotEncoder
is a Transofrmer
not an Estimator
. It doesn't store any information about the levels but depends on the Column
metadata to determine output dimensions. If metadata is missing, like in your case, it uses fallback strategy and assumes there is max(input_column)
levels. Serialization is irrelevant here.
Typical usage involves Transformers
in the upstream Pipeline
, which set metadata for you. One common example is StringIndexer
.
It is still possible to set metadata manually, but it is more involved:
import org.apache.spark.ml.attribute.NominalAttribute
val meta = NominalAttribute.defaultAttr
.withName("class")
.withValues("0", (1 to 5).map(_.toString): _*)
.toMetadata
loadedModel.transform(df2.select($"class".as("class", meta), $"output"))
Similarly in Python (needs Spark >= 2.2):
from pyspark.sql.functions import col
meta = {"ml_attr": {
"vals": [str(x) for x in range(6)], # Provide a set of levels
"type": "nominal",
"name": "class"}}
loaded.transform(
df.withColumn("class", col("class").alias("class", metadata=meta))
)
Metadata can be also attached using a number of different methods: How to change column metadata in pyspark?.