Convert StringType to ArrayType in PySpark

前端 未结 2 1320
一生所求
一生所求 2021-01-22 11:02

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.

from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol=\"name\", minSupport=0.5,mi         


        
相关标签:
2条回答
  • 2021-01-22 11:33

    Split by comma for each row in the name column of your dataframe. e.g.

    from pyspark.sql.functions import pandas_udf, PandasUDFType
    
    @pandas_udf('list', PandasUDFType.SCALAR)
    def split_comma(v):
        return v[1:-1].split(',')
    
    df.withColumn('name', split_comma(df.name))
    

    Or better, don't defer this. Set name directly to the list.

    rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
    rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
    
    0 讨论(0)
  • 2021-01-22 11:59

    Based on your previous question, it seems as though you are building rdd2 incorrectly.

    Try this:

    rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
    rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))
    

    The change is that we call str.split(",") on x[0][1] so that it will convert a string like 'a,b' to a list: ['a', 'b'].

    0 讨论(0)
提交回复
热议问题