How to implement self-join/cross-product with hadoop?
问题 It is common task to make some evaluation on pairs of items: Examples: de-duplication, collaborative filtering, similar items etc This is basically self-join or cross-product with the same source of data. 回答1: To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value. So, let's say we wanted to do a self-join on "city" (the middle column) on the following data: don,baltimore,12 jerry,boston,19 bob,baltimore,99