Spark: write JSON several files from DataFrame based on separation by column value

问题

Suppose I have this DataFrame (df):

user    food        affinity
'u1'    'pizza'       5 
'u1'    'broccoli'    3
'u1'    'ice cream'   4
'u2'    'pizza'       1
'u2'    'broccoli'    3
'u2'    'ice cream'   1

Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user, with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing

[
    {'food': 'pizza', 'affinity': 5},
    {'food': 'broccoli', 'affinity': 3},
    {'food': 'ice cream', 'affinity': 4},
]

This would entail a separation of the DataFrame by user and I cannot think of a way to do this as the writing of a JSON file would be achieved, for full DataFrame, with

df.write.json(<path_to_file>)

回答1:

You can partitionBy (it will give you a single directory and possibly multiple files per user):

df.write.partitionBy("user").json(<path_to_file>)

or repartition and partitionBy (it will give you a single directory and a single file per user):

df.repartition(col("user")).write.partitionBy("user").json(<path_to_file>)

Unfortunately none of the above will give you a JSON array.

If you use Spark 2.0 you can try with collect list first:

df.groupBy(col("user")).agg(
  collect_list(struct(col("food"), col("affinity"))).alias("affinities")
)

and partitionBy on write as before.

Prior to 2.0 you'll have to use RDD API, but it is language specific.

来源：https://stackoverflow.com/questions/40725884/spark-write-json-several-files-from-dataframe-based-on-separation-by-column-val

标签

json

apache-spark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!