问题
I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".
In Spark, what is the difference between partitioning the data by column and bucketing the data by column?
for example:
partition:
df2 = df2.repartition(10, "SaleId")
bucket:
df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))
After each one of those techniques I just joined df2 with df1.
I can't figure out which of those is the right technique to use. Thank you
回答1:
repartition is for using as part of an Action in the same Spark Job.
bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.
来源:https://stackoverflow.com/questions/56857453/what-is-the-difference-between-partitioning-and-bucketing-in-spark