问题
I am using Spark 2.4.1 with Java 8 in my project.
I have a scenario where I need to look-up another table/dataset which has two fields i.e. country-name and country-code.
Another stream-data will have country-code column in it, I need to map respective country-name in the target/result dataframe.
As far as I know, we can use join to achieve the above, using broadcast variable and joining.
So from performance point of view which one is better here? What is the spark standard to handle this kind of use-cases?
回答1:
Quite honestly they should perform similarly, since they are effectively doing the same thing.
There may be a very slight advantage to allowing spark to do the broadcast join inherently, but it likely depends on your fact table size and overall effect of a broadcast variable's overhead.
One thing to take note of, the default broadcast threshold is only 10MiB, so if your dimension table is larger than that, you'll want to explicitly use the broadcast() hint.
来源:https://stackoverflow.com/questions/60728487/which-one-will-perform-better-broadcast-variable-or-broadcast-join