问题
Suppose I have this DataFrame (df
):
user food affinity
'u1' 'pizza' 5
'u1' 'broccoli' 3
'u1' 'ice cream' 4
'u2' 'pizza' 1
'u2' 'broccoli' 3
'u2' 'ice cream' 1
Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user, with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing
[
{'food': 'pizza', 'affinity': 5},
{'food': 'broccoli', 'affinity': 3},
{'food': 'ice cream', 'affinity': 4},
]
This would entail a separation of the DataFrame by user and I cannot think of a way to do this as the writing of a JSON file would be achieved, for full DataFrame, with
df.write.json(<path_to_file>)
回答1:
You can partitionBy
(it will give you a single directory and possibly multiple files per user):
df.write.partitionBy("user").json(<path_to_file>)
or repartition
and partitionBy
(it will give you a single directory and a single file per user):
df.repartition(col("user")).write.partitionBy("user").json(<path_to_file>)
Unfortunately none of the above will give you a JSON array.
If you use Spark 2.0 you can try with collect list first:
df.groupBy(col("user")).agg(
collect_list(struct(col("food"), col("affinity"))).alias("affinities")
)
and partitionBy
on write as before.
Prior to 2.0 you'll have to use RDD API, but it is language specific.
来源:https://stackoverflow.com/questions/40725884/spark-write-json-several-files-from-dataframe-based-on-separation-by-column-val