I have several categorical features and would like to transform them all using OneHotEncoder
. However, when I tried to apply the StringIndexer
, there I
Spark >= 3.0:
In Spark 3.0 OneHotEncoderEstimator
has been renamed to OneHotEncoder
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
encoder = OneHotEncoderEstimator(...)
with
from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel
encoder = OneHotEncoder(...)
Spark >= 2.3
You can use newly added OneHotEncoderEstimator
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
encoder = OneHotEncoderEstimator(
inputCols=[indexer.getOutputCol() for indexer in indexers],
outputCols=[
"{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
)
assembler = VectorAssembler(
inputCols=encoder.getOutputCols(),
outputCol="features"
)
pipeline = Pipeline(stages=indexers + [encoder, assembler])
pipeline.fit(df).transform(df)
Spark < 2.3
It is not possible. StringIndexer
transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
cols = ['a', 'b', 'c', 'd']
indexers = [
StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
for c in cols
]
encoders = [
OneHotEncoder(
inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
assembler = VectorAssembler(
inputCols=[encoder.getOutputCol() for encoder in encoders],
outputCol="features"
)
pipeline = Pipeline(stages=indexers + encoders + [assembler])
pipeline.fit(df).transform(df).show()