Converting dataframe to dictionary in pyspark without using pandas

问题

Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:

dictionary = df_2.unstack().to_dict(orient='index')

However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?

EDIT:

I have now tried the following approach:

dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary  = {age['age']: age for age in dictionary_list}

(reference) but it is not yielding what it is supposed to.

In pandas, what I was obtaining was the following:

回答1:

df2 is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.

import pyspark.sql.functions as F

df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}

{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}

Or another way:

df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])

{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}

来源：https://stackoverflow.com/questions/65717912/converting-dataframe-to-dictionary-in-pyspark-without-using-pandas

标签

pandas

apache-spark

dictionary

pyspark

apache-spark-sql