Applying a function in each row of a big PySpark dataframe?

前端未结

关注

 2  533

自闭症患者 2021-02-14 04:24

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a di

2条回答

轻奢々 (楼主)

2021-02-14 04:44

Can you try something like below and let us know if it works for you?

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType

#sample data
df = sc.parallelize([
    ['a', 'b'],
    ['c', 'd'],
    ['e', 'f']
]).toDF(('col1', 'col2'))

#add logic to create dictionary element using rows of the dataframe    
def add_to_dict(l):
    d = {}
    d[l[0]] = l[1]
    return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()

#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list

Output is:

[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]

Hope this helps!

0 讨论(0)

查看其它2个回答