发表新帖

发表新帖

Pyspark string array of dynamic length in dataframe column to onehot-encoded

前端未结

关注

 2  1673

没有蜡笔的小新 2021-01-23 17:36

I would like to convert a column which contains strings like:

 [\"ABC\",\"def\",\"ghi\"] 
 [\"Jkl\",\"ABC\",\"def\"]
 [\"Xyz\",\"ABC\"]

Into a

2条回答

清歌不尽 (楼主)

2021-01-23 18:22
You will have to expand the list in a single column to multiple n columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.

Please follow the example in the documentation:
```
from pyspark.ml.feature import OneHotEncoderEstimator

df = spark.createDataFrame([
    (0.0, 1.0),
    (1.0, 0.0),
    (2.0, 1.0),
    (0.0, 2.0),
    (0.0, 1.0),
    (2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])

encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
                                 outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
```
OneHotEncoder class has been deprecated since v2.3 because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.

This will help you to split the list: How to split a list to multiple columns in Pyspark?
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题