问题
I am working on a Big hadoop project and there is a small KPI, where I have to write only the top 10 values in reduces output. To complete this requirement, I have used a counter and break the loop when counter is equal to 11, but still reducer writes all of the values to HDFS.
This is a pretty simple java code, but I am stuck :(
For testing, I have created one stand alone class (java application) to do this and this is working there; I'm wondering why it is not working in reducer code.
Please some one help me out and suggest if I missing something.
MAP - REDUCE CODE
package comparableTest;
import java.io.IOException;
import java.nio.ByteBuffer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.IntWritable.Comparator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ValueSortExp2 {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration(true);
String arguments[] = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Test commond");
job.setJarByClass(ValueSortExp2.class);
// Setup MapReduce
job.setMapperClass(MapTask2.class);
job.setReducerClass(ReduceTask2.class);
job.setNumReduceTasks(1);
// Specify key / value
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setSortComparatorClass(IntComparator2.class);
// Input
FileInputFormat.addInputPath(job, new Path(arguments[0]));
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, new Path(arguments[1]));
job.setOutputFormatClass(TextOutputFormat.class);
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
public static class IntComparator2 extends WritableComparator {
public IntComparator2() {
super(IntWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();
return v1.compareTo(v2) * (-1);
}
}
public static class MapTask2 extends Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key,Text value, Context context) throws IOException, InterruptedException {
String tokens[]= value.toString().split("\\t");
// int empId = Integer.parseInt(tokens[0]) ;
int count = Integer.parseInt(tokens[2]) ;
context.write(new IntWritable(count), new Text(value));
}
}
public static class ReduceTask2 extends Reducer<IntWritable, Text, IntWritable, Text> {
int cnt=0;
public void reduce(IntWritable key, Iterable<Text> list, Context context)
throws java.io.IOException, InterruptedException {
for (Text value : list ) {
cnt ++;
if (cnt==11)
{
break;
}
context.write(new IntWritable(cnt), value);
}
}
}
}
SIMPLE JAVA CODE WOKING FINE
package comparableTest;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class TestData {
//static int cnt=0;
public static void main(String args[]) throws IOException, InterruptedException {
ArrayList<String> list = new ArrayList<String>() {{
add("A");
add("B");
add("C");
add("D");
}};
reduce(list);
}
public static void reduce(Iterable<String> list)
throws java.io.IOException, InterruptedException {
int cnt=0;
for (String value : list ) {
cnt ++;
if (cnt==3)
{
break;
}
System.out.println(value);
}
}
}
Sample data --Header is only more info, actual data is from 2nd line
`ID NAME COUNT (need to display top 10 desc)
1 Toy Story (1995) 2077
10 GoldenEye (1995) 888
100 City Hall (1996) 128
1000 Curdled (1996) 20
1001 Associate, The (L'Associe)(1982) 0
1002 Ed's Next Move (1996) 8
1003 Extreme Measures (1996) 121
1004 Glimmer Man, The (1996) 101
1005 D3: The Mighty Ducks (1996) 142
1006 Chamber, The (1996) 78
1007 Apple Dumpling Gang, The (1975) 232
1008 Davy Crockett, King of the Wild Frontier (1955) 97
1009 Escape to Witch Mountain (1975) 291
101 Bottle Rocket (1996) 253
1010 Love Bug, The (1969) 242
1011 Herbie Rides Again (1974) 135
1012 Old Yeller (1957) 301
1013 Parent Trap, The (1961) 258
1014 Pollyanna (1960) 136
1015 Homeward Bound: The Incredible Journey (1993) 234
1016 Shaggy Dog, The (1959) 156
1017 Swiss Family Robinson (1960) 276
1018 That Darn Cat! (1965) 123
1019 20,000 Leagues Under the Sea (1954) 575
102 Mr. Wrong (1996) 60
1020 Cool Runnings (1993) 392
1021 Angels in the Outfield (1994) 247
1022 Cinderella (1950) 577
1023 Winnie the Pooh and the Blustery Day (1968) 221
1024 Three Caballeros, The (1945) 126
1025 Sword in the Stone, The (1963) 293
1026 So Dear to My Heart (1949) 8
1027 Robin Hood: Prince of Thieves (1991) 344
1028 Mary Poppins (1964) 1011
1029 Dumbo (1941) 568
103 Unforgettable (1996) 33
1030 Pete's Dragon (1977) 323
1031 Bedknobs and Broomsticks (1971) 319`
回答1:
If you move int cnt=0;
inside the reduce method (as the first statement of this method), you will get the first 10 values for each key (I guess this is what you want).
Otherwise, as it is now, your counter will keep increasing and you will skip the 11th value only (regardless of key), continuing with the 12th.
If you want to print only 10 values (regardless of key), you leave the cnt
initialization where it is, and change your if
condition to if (cnt > 10)
...
However, this is not a good practice, so you may need to reconsider your algorithm. (assuming you don't want 10 random values, how do you know which key will be processed first in a distributed environment, when you have more than 1 reducers and a hash partitioner?)
来源:https://stackoverflow.com/questions/46087100/counter-is-not-working-in-reducer-code