Converting dataframe to dictionary in pyspark without using pandas

…衆ロ難τιáo~ 提交于 2021-02-11 16:54:15

问题


Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:

dictionary = df_2.unstack().to_dict(orient='index')

However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?

EDIT:

I have now tried the following approach:

dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary  = {age['age']: age for age in dictionary_list}

(reference) but it is not yielding what it is supposed to.

In pandas, what I was obtaining was the following:


回答1:


df2 is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.

import pyspark.sql.functions as F

df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}

{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}

Or another way:

df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])

{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}


来源:https://stackoverflow.com/questions/65717912/converting-dataframe-to-dictionary-in-pyspark-without-using-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!