问题
Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:
dictionary = df_2.unstack().to_dict(orient='index')
However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?
EDIT:
I have now tried the following approach:
dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary = {age['age']: age for age in dictionary_list}
(reference) but it is not yielding what it is supposed to.
In pandas, what I was obtaining was the following:
回答1:
df2
is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.
import pyspark.sql.functions as F
df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}
{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}
Or another way:
df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])
{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}
来源:https://stackoverflow.com/questions/65717912/converting-dataframe-to-dictionary-in-pyspark-without-using-pandas