Join of two datasets in Mapreduce/Hadoop

前端 未结 2 1029
温柔的废话
温柔的废话 2021-02-06 00:20

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here\'s what I exactly need to do:

I am having t

2条回答
  •  你的背包
    2021-02-06 00:33

    Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

    So the map output will be something like:

     tile0, _point0
     tile1, _point0
     tile2, _point1 
     ...
     tileX, *lineL
     tileY, *lineK
     ...
    

    Then, at the reducer, your input will have this structure:

     tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
    

    and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

    tileX (lineK, pointP)
    tileX (lineK, pointR)
    ...
    

    If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

    Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

     List points;
     List lines;
    

    Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

    tile(current key), element_of_the_resulting_cross_product_list
    

提交回复
热议问题