Applying a function in each row of a big PySpark dataframe?

前端 未结 2 533
自闭症患者
自闭症患者 2021-02-14 04:24

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a di

2条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-02-14 04:44

    Can you try something like below and let us know if it works for you?

    from pyspark.sql.functions import udf, struct
    from pyspark.sql.types import StringType, MapType
    
    #sample data
    df = sc.parallelize([
        ['a', 'b'],
        ['c', 'd'],
        ['e', 'f']
    ]).toDF(('col1', 'col2'))
    
    #add logic to create dictionary element using rows of the dataframe    
    def add_to_dict(l):
        d = {}
        d[l[0]] = l[1]
        return d
    add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
    #struct is used to pass rows of dataframe
    df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
    df.show()
    
    #list of dictionary elements
    dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
    print dictionary_list
    

    Output is:

    [{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]
    

    Hope this helps!

提交回复
热议问题