I would like to convert a column which contains strings like:
[\"ABC\",\"def\",\"ghi\"]
[\"Jkl\",\"ABC\",\"def\"]
[\"Xyz\",\"ABC\"]
Into a
You will have to expand the list in a single column to multiple n
columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.
Please follow the example in the documentation:
from pyspark.ml.feature import OneHotEncoderEstimator
df = spark.createDataFrame([
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])
encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
OneHotEncoder class has been deprecated since v2.3
because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.
This will help you to split the list: How to split a list to multiple columns in Pyspark?