Pyspark string array of dynamic length in dataframe column to onehot-encoded

前端未结

关注

 2  1675

I would like to convert a column which contains strings like:

 [\"ABC\",\"def\",\"ghi\"] 
 [\"Jkl\",\"ABC\",\"def\"]
 [\"Xyz\",\"ABC\"]

Into a

相关标签:

2条回答

清歌不尽

2021-01-23 18:22
You will have to expand the list in a single column to multiple n columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.

Please follow the example in the documentation:
```
from pyspark.ml.feature import OneHotEncoderEstimator

df = spark.createDataFrame([
    (0.0, 1.0),
    (1.0, 0.0),
    (2.0, 1.0),
    (0.0, 2.0),
    (0.0, 1.0),
    (2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])

encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
                                 outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
```
OneHotEncoder class has been deprecated since v2.3 because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.

This will help you to split the list: How to split a list to multiple columns in Pyspark?
0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2021-01-23 18:26

You can probably use CountVectorizer, Below is an example:

Update: removed the step to drop duplicates in arrays, you can set binary=True when setting up CountVectorizer:

from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col

df = spark.createDataFrame([
        (["ABC","def","ghi"],)
      , (["Jkl","ABC","def"],)
      , (["Xyz","ABC"],)
    ], ['arr']
)

create the CountVectorizer model:

cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)

model = cv.fit(df)

vocabulary = model.vocabulary
# [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']

Create a UDF to convert a vector into array

udf_to_array = udf(lambda v: v.toArray().tolist(), 'array<double>')

Get the vector and check the content:

df1 = model.transform(df)

df1.withColumn('c2', udf_to_array('c1')) \
   .select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
   .show(3,0)
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|arr            |c1                       |c2                       |ABC|def|Xyz|ghi|Jkl|
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1  |1  |0  |1  |0  |
|[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1  |1  |0  |0  |1  |
|[Xyz, ABC]     |(5,[0,2],[1.0,1.0])      |[1.0, 0.0, 1.0, 0.0, 0.0]|1  |0  |1  |0  |0  |
+---------------+-------------------------+-------------------------+---+---+---+---+---+

0 讨论(0)