PySpark - Adding a Column from a list of values using a UDF

前端 未结 5 1402
臣服心动
臣服心动 2021-01-05 00:32

I have to add column to a PySpark dataframe based on a list of values.

a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"         


        
5条回答
  •  孤街浪徒
    2021-01-05 01:16

    What you are trying to do does not work, because the rating list is in your driver's memory, whereas the a dataframe is in the executor's memory (the udf works on the executors too).

    What you need to do is add the keys to the ratings list, like so:

    ratings = [('Dog', 5), ('Cat', 4), ('Mouse', 1)]
    

    Then you create a ratings dataframe from the list and join both to get the new colum added:

    ratings_df = spark.createDataFrame(ratings, ['Animal', 'Rating'])
    new_df = a.join(ratings_df, 'Animal')
    

提交回复
热议问题