apply OneHotEncoder for several categorical columns in SparkMlib

前端 未结 2 1390
情话喂你
情话喂你 2021-02-03 13:16

I have several categorical features and would like to transform them all using OneHotEncoder. However, when I tried to apply the StringIndexer, there I

相关标签:
2条回答
  • 2021-02-03 13:38

    Spark >= 3.0:

    In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder:

    from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
    
    encoder = OneHotEncoderEstimator(...)
    

    with

    from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel
    
    encoder = OneHotEncoder(...)
    

    Spark >= 2.3

    You can use newly added OneHotEncoderEstimator:

    from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel
    
    encoder = OneHotEncoderEstimator(
        inputCols=[indexer.getOutputCol() for indexer in indexers],
        outputCols=[
            "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
    )
    
    assembler = VectorAssembler(
        inputCols=encoder.getOutputCols(),
        outputCol="features"
    )
    
    pipeline = Pipeline(stages=indexers + [encoder, assembler])
    pipeline.fit(df).transform(df)
    

    Spark < 2.3

    It is not possible. StringIndexer transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
    
    cols = ['a', 'b', 'c', 'd']
    
    indexers = [
        StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
        for c in cols
    ]
    
    encoders = [
        OneHotEncoder(
            inputCol=indexer.getOutputCol(),
            outputCol="{0}_encoded".format(indexer.getOutputCol())) 
        for indexer in indexers
    ]
    
    assembler = VectorAssembler(
        inputCols=[encoder.getOutputCol() for encoder in encoders],
        outputCol="features"
    )
    
    
    pipeline = Pipeline(stages=indexers + encoders + [assembler])
    pipeline.fit(df).transform(df).show()
    
    0 讨论(0)
  • 2021-02-03 13:42

    I think the above code will not give the same results as required. In the encoders section, there is required a little modification. Because, again the StringIndexer is applied on Indexers.So, that will results in the same results.

    #In the following section:
    encoders = [
        StringIndexer(
            inputCol=indexer.getOutputCol(),
            outputCol="{0}_encoded".format(indexer.getOutputCol())) 
        for indexer in indexers
    ]
    
    #Replace the StringIndexer with OneHotEncoder as follows:
    encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
                outputCol="{0}_encoded".format(indexer.getOutputCol())) 
                for indexer in indexers
    ]
    

    Now, the complete code look like the following:

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
    
    categorical_columns= ['Gender', 'Age', 'Occupation', 'City_Category','Marital_Status']
    
    # The index of string vlaues multiple columns
    indexers = [
        StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
        for c in categorical_columns
    ]
    
    # The encode of indexed vlaues multiple columns
    encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
                outputCol="{0}_encoded".format(indexer.getOutputCol())) 
        for indexer in indexers
    ]
    
    # Vectorizing encoded values
    assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="features")
    
    pipeline = Pipeline(stages=indexers + encoders+[assembler])
    model=pipeline.fit(data_df)
    transformed = model.transform(data_df)
    transformed.show(5)
    

    For more details,please refer: visit:[1] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer visit:[2] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder.

    0 讨论(0)
提交回复
热议问题