发表新帖

发表新帖

Join of two datasets in Mapreduce/Hadoop

前端未结

关注

 2  1029

温柔的废话 2021-02-06 00:20

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here\'s what I exactly need to do:

I am having t

2条回答

你的背包 (楼主)

2021-02-06 00:33
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:
```
 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...
```
Then, at the reducer, your input will have this structure:
```
 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
```
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
```
tileX (lineK, pointP)
tileX (lineK, pointR)
...
```
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:
```
 List points;
 List lines;
```
Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:
```
tile(current key), element_of_the_resulting_cross_product_list
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题