I have to add column to a PySpark dataframe based on a list of values.
a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"
What you are trying to do does not work, because the rating
list is in your driver's memory, whereas the a
dataframe is in the executor's memory (the udf works on the executors too).
What you need to do is add the keys to the ratings
list, like so:
ratings = [('Dog', 5), ('Cat', 4), ('Mouse', 1)]
Then you create a ratings
dataframe from the list and join both to get the new colum added:
ratings_df = spark.createDataFrame(ratings, ['Animal', 'Rating'])
new_df = a.join(ratings_df, 'Animal')