Which one will perform better, broadcast variable or broadcast join?

问题

I am using Spark 2.4.1 with Java 8 in my project.

I have a scenario where I need to look-up another table/dataset which has two fields i.e. country-name and country-code.

Another stream-data will have country-code column in it, I need to map respective country-name in the target/result dataframe.

As far as I know, we can use join to achieve the above, using broadcast variable and joining.

So from performance point of view which one is better here? What is the spark standard to handle this kind of use-cases?

回答1:

Quite honestly they should perform similarly, since they are effectively doing the same thing.

There may be a very slight advantage to allowing spark to do the broadcast join inherently, but it likely depends on your fact table size and overall effect of a broadcast variable's overhead.

One thing to take note of, the default broadcast threshold is only 10MiB, so if your dimension table is larger than that, you'll want to explicitly use the broadcast() hint.

来源：https://stackoverflow.com/questions/60728487/which-one-will-perform-better-broadcast-variable-or-broadcast-join

标签

dataframe

apache-spark

join

apache-spark-sql

broadcast

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!