I have a question about configuring Map/Side inner join for multiple mappers in Hadoop. Suppose I have two very large data sets A and B, I use the same partition and sort a
There are map- and reduce side joins.
You proposed to use a map side join, which is executed inside a mapper and not before it.
Both sides must have the same key and value types. So you can't join a LongWritable
and a Text
, although they might have the same value.
There are subtle more things to note:
The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable
.
If the number of the files to be joined is not equal it will result in an exception during job setup.
Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.
This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.
Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.
I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.