Convert StringType to ArrayType in PySpark

前端 未结 2 1316
一生所求
一生所求 2021-01-22 11:02

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.

from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol=\"name\", minSupport=0.5,mi         


        
2条回答
  •  被撕碎了的回忆
    2021-01-22 11:33

    Split by comma for each row in the name column of your dataframe. e.g.

    from pyspark.sql.functions import pandas_udf, PandasUDFType
    
    @pandas_udf('list', PandasUDFType.SCALAR)
    def split_comma(v):
        return v[1:-1].split(',')
    
    df.withColumn('name', split_comma(df.name))
    

    Or better, don't defer this. Set name directly to the list.

    rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
    rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
    

提交回复
热议问题