Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here\'s what I exactly need to do:
I am having t
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
tile0, _point0
tile1, _point0
tile2, _point1
...
tileX, *lineL
tileY, *lineK
...
Then, at the reducer, your input will have this structure:
tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
tileX (lineK, pointP)
tileX (lineK, pointR)
...
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:
List points;
List lines;
Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:
tile(current key), element_of_the_resulting_cross_product_list