How to serialize PySpark GroupedData object?

♀尐吖头ヾ 提交于 2019-12-24 09:24:13

问题


I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed).

df.groupBy("geo_city")
<pyspark.sql.group.GroupedData at 0x10503c5d0>

I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive). Is there some other efficient way to store the GroupedData object into some binary format for faster read/write? Possibly some equivalent of pickle in Spark?


回答1:


There is none because GroupedData is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg.

You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy only describes what has to be done the cost of recreating GroupedData object from scratch should be negligible.



来源:https://stackoverflow.com/questions/38600908/how-to-serialize-pyspark-groupeddata-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!