Applying a function in each row of a big PySpark dataframe?

前端 未结 2 541
自闭症患者
自闭症患者 2021-02-14 04:24

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a di

相关标签:
2条回答
  • 2021-02-14 04:44

    Can you try something like below and let us know if it works for you?

    from pyspark.sql.functions import udf, struct
    from pyspark.sql.types import StringType, MapType
    
    #sample data
    df = sc.parallelize([
        ['a', 'b'],
        ['c', 'd'],
        ['e', 'f']
    ]).toDF(('col1', 'col2'))
    
    #add logic to create dictionary element using rows of the dataframe    
    def add_to_dict(l):
        d = {}
        d[l[0]] = l[1]
        return d
    add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
    #struct is used to pass rows of dataframe
    df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
    df.show()
    
    #list of dictionary elements
    dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
    print dictionary_list
    

    Output is:

    [{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]
    

    Hope this helps!

    0 讨论(0)
  • 2021-02-14 04:56

    By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).

    What could you do:

    • reimplement your logic using functions already available: pyspark.sql.functions doc

    • if you cannot do the first, because there is functionality missing, you can define a User Defined Function

    0 讨论(0)
提交回复
热议问题