Join vs COGROUP in PIG

前端 未结 1 863
你的背包
你的背包 2021-02-19 19:45

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ?

http://developer.yahoo.com/hadoop/tutorial/module6.html tal

相关标签:
1条回答
  • 2021-02-19 20:33

    There are no major performance differences. The reason I say this is they both end up being a single MapReduce job that send the same data forward to the reducers. Both need to send all of the records forward with the key being the foreign key. If at all, the COGROUP might be a bit faster because it does not do the cartesian product across the hits and keeps them in separate bags.

    If one of your data sets is small, you can use a join option called "replicated join". This will distribute the second data set across all map tasks and load it into main memory. This way, it can do the entire join in the mapper and not need a reducer. In my experience, this is very worth it because the bottleneck in joins and cogroups is the shuffling of the entire data set to the reducer. You can't do this with COGROUP, to my knowledge.

    0 讨论(0)
提交回复
热议问题