Join of two datasets in Mapreduce/Hadoop

前端 未结 2 1023
温柔的废话
温柔的废话 2021-02-06 00:20

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here\'s what I exactly need to do:

I am having t

2条回答
  •  臣服心动
    2021-02-06 00:49

    So basically you have two options here.Reduce side join or Map Side Join .

    Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.

    If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

提交回复
热议问题