I have a PySpark dataframe
+-------+--------------+----+----+ |address| date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra | |1111112|20151122065832| Yun|ddd | |1111113|20160101003221| Yan|fdf | |1111111|20160703045231| Yin|gre | |1111114|20150419134543| Yin|fdf | |1111115|20151123174302| Yen|ddd | |2111115| 20123192| Yen|gre | +-------+--------------+----+----+
that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:
indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df) df_ind = indexer.transform(df) df_ind.show() +-------+--------------+----+----------+----+ |address| date|name|name_index|food| +-------+--------------+----+----------+----+ |1111111|20151122045510| Yin| 0.0|gre | |1111111|20151122045501| Yin| 0.0|gre | |1111111|20151122045500| Yln| 2.0|gra | |1111112|20151122065832| Yun| 4.0|ddd | |1111113|20160101003221| Yan| 3.0|fdf | |1111111|20160703045231| Yin| 0.0|gre | |1111114|20150419134543| Yin| 0.0|fdf | |1111115|20151123174302| Yen| 1.0|ddd | |2111115| 20123192| Yen| 1.0|gre | +-------+--------------+----+----------+----+
How can I transform several columns with StringIndexer (for example, name
and food
, each with its own StringIndexer
) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer
for each column?
** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer
or VectorAssembler
because the columns are not numerical.
** EDIT 2**: A tentative solution is
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.