问题
I was trying to analyse the default map reduce job, that doesn't define a mapper or a reducer. i.e. one that uses IdentityMapper & IdentityReducer To make myself clear I just wrote my identity reducer
public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> {
@Override
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
while(values.hasNext()) {
Text value = values.next();
output.collect(key, value);
}
}
}
My input file was :
$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta
I was expecting
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
I got
$ hadoop fs -cat NameAddress/part-00000
Dhoni Ranchi
Dhoni Chennai
Dravid Banglore
Dravid Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
I was of the opinion that since the aggregations are done by the programmer in the while loop of the reducer and then written to the outputcollector. I was of the impression that the keys of the reducer passed to outputcollector are always unique & since here if i don't aggregate, the last key's values overrides the previous value. Clearly its not the case. Could someone please give me a better insite of the outputcollector, how it works and how it handles all the keys. I see many implementations of outputcollector in the hadoop src code. Can i write my own outputcollector that can do what i am expecting?
回答1:
The keys are unique for the reducer and each call to the reducer has a key value that's unique and an iterable of all values associated with that key. What you're doing is iterating over all of the values passed in and writing out each one.
So it doesn't matter that there might be fewer calls than data in your case. You still end up writing all of the values out.
来源:https://stackoverflow.com/questions/12763478/how-outputcollector-works