How to implement self-join/cross-product with hadoop?

问题

It is common task to make some evaluation on pairs of items: Examples: de-duplication, collaborative filtering, similar items etc This is basically self-join or cross-product with the same source of data.

回答1:

To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.

So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:

don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2

The mapper would emit the key->value pairs:

(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)

In the reducer, we'll get this:

(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])

From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.

Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.

The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.

Cross product in the case where you don't have any constraints is a nightmare. I.e.,

select * from tablea, tableb;

There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.

If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.

Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.

回答2:

One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.

来源：https://stackoverflow.com/questions/11066434/how-to-implement-self-join-cross-product-with-hadoop

标签

Hadoop

MapReduce

self-join