I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol=\"name\", minSupport=0.5,mi
Split by comma for each row in the name
column of your dataframe. e.g.
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('list', PandasUDFType.SCALAR)
def split_comma(v):
return v[1:-1].split(',')
df.withColumn('name', split_comma(df.name))
Or better, don't defer this. Set name directly to the list.
rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
Based on your previous question, it seems as though you are building rdd2
incorrectly.
Try this:
rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
The change is that we call str.split(",")
on x[0][1]
so that it will convert a string like 'a,b' to a list: ['a', 'b']
.