Pyspark string array of dynamic length in dataframe column to onehot-encoded

前端 未结 2 1673
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 17:36

I would like to convert a column which contains strings like:

 [\"ABC\",\"def\",\"ghi\"] 
 [\"Jkl\",\"ABC\",\"def\"]
 [\"Xyz\",\"ABC\"]

Into a

2条回答
  •  清歌不尽
    2021-01-23 18:22

    You will have to expand the list in a single column to multiple n columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.

    Please follow the example in the documentation:

    from pyspark.ml.feature import OneHotEncoderEstimator
    
    df = spark.createDataFrame([
        (0.0, 1.0),
        (1.0, 0.0),
        (2.0, 1.0),
        (0.0, 2.0),
        (0.0, 1.0),
        (2.0, 0.0)
    ], ["categoryIndex1", "categoryIndex2"])
    
    encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
                                     outputCols=["categoryVec1", "categoryVec2"])
    model = encoder.fit(df)
    encoded = model.transform(df)
    encoded.show()
    

    OneHotEncoder class has been deprecated since v2.3 because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.

    This will help you to split the list: How to split a list to multiple columns in Pyspark?

提交回复
热议问题