I'm new to Hadoop, and i'm trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :
2017-06-01 , A, B, A, C, B, E, F
2017-06-02 , Q, B, Q, F, K, E, F
2017-06-03 , A, B, A, R, T, E, E
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
so, i'm expeting as result of this MapReducer program, something like :
2017-06, A:4, E:4
2017-07, A:4, B:4
public class ArrayGiulioTest {
public static Logger logger = Logger.getLogger(ArrayGiulioTest.class);
public static class CustomMap extends Mapper<LongWritable, Text, Text, TextWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
TextWritable array = new TextWritable();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line, ",");
String dataAttuale = tokenizer.nextToken().substring(0,
line.lastIndexOf("-"));
Text tmp = null;
Text[] tmpArray = new Text[tokenizer.countTokens()];
int i = 0;
while (tokenizer.hasMoreTokens()) {
String prod = tokenizer.nextToken(",");
word.set(dataAttuale);
tmp = new Text(prod);
tmpArray[i] = tmp;
i++;
}
array.set(tmpArray);
context.write(word, array);
}
}
public static class CustomReduce extends Reducer<Text, TextWritable, Text, Text> {
public void reduce(Text key, Iterator<TextWritable> values,
Context context) throws IOException, InterruptedException {
MapWritable map = new MapWritable();
Text txt = new Text();
while (values.hasNext()) {
TextWritable array = values.next();
Text[] tmpArray = (Text[]) array.toArray();
for(Text t : tmpArray) {
if(map.get(t)!= null) {
IntWritable val = (IntWritable) map.get(t);
map.put(t, new IntWritable(val.get()+1));
} else {
map.put(t, new IntWritable(1));
}
}
}
Set<Writable> set = map.keySet();
StringBuffer str = new StringBuffer();
for(Writable k : set) {
str.append("key: " + k.toString() + " value: " + map.get(k) + "**");
}
txt.set(str.toString());
context.write(key, txt);
}
}
public static void main(String[] args) throws Exception {
long inizio = System.currentTimeMillis();
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "countProduct");
job.setJarByClass(ArrayGiulioTest.class);
job.setMapperClass(CustomMap.class);
//job.setCombinerClass(CustomReduce.class);
job.setReducerClass(CustomReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TextWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
long fine = System.currentTimeMillis();
logger.info("**************************************End" + (End-Start));
System.exit(1);
}
}
and i've implemented my custom TextWritable in this way :
public class TextWritable extends ArrayWritable {
public TextWritable() {
super(Text.class);
}
}
..so when i run my MapReduce program i obtain a result of this kind
2017-6 wordcount.TextWritable@3e960865
2017-6 wordcount.TextWritable@3e960865
it's obvious that my reducer it doesn't works. It seems the output from my Mapper
Any idea? And someone can says if is the right path to the solution?
Here Console Log (Just for information, my input file has 6 rows instead of 5) *I obtain the same result starting MapReduce problem under eclipse(mono JVM) or using Hadoop with Hdfs
File System Counters
FILE: Number of bytes read=1216
FILE: Number of bytes written=431465
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=6
Map output records=6
Map output bytes=214
Map output materialized bytes=232
Input split bytes=97
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=232
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=394264576
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=208
File Output Format Counters
Bytes Written=1813
I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).
The following approach is going to turn these lines, for example
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
Into this pair for the reducer
2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
In other words, you gain no real benefit by using an ArrayWritable
, just keep it as text.
So, the Mapper would look like this
class CustomMap extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text();
private final Text output = new Text();
@Override
protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
int separatorIndex = value.find(",");
final String valueStr = value.toString();
if (separatorIndex < 0) {
System.err.printf("mapper: not enough records for %s", valueStr);
return;
}
String dateKey = valueStr.substring(0, separatorIndex).trim();
String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
try {
dateKey = fmtTo.format(fmtFrom.parse(dateKey));
key.set(dateKey);
} catch (ParseException ex) {
System.err.printf("mapper: invalid key format %s", dateKey);
return;
}
output.set(tokens);
context.write(key, output);
}
}
And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.
class CustomReduce extends Reducer<Text, Text, Text, Text> {
private final Text output = new Text();
@Override
protected void reduce(Text date, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, Integer> keyMap = new TreeMap<>();
for (Text v : values) {
String[] keys = v.toString().trim().split(",");
for (String key : keys) {
if (!keyMap.containsKey(key)) {
keyMap.put(key, 0);
}
keyMap.put(key, 1 + keyMap.get(key));
}
}
output.set(mapToString(keyMap));
context.write(date, output);
}
private String mapToString(Map<String, Integer> map) {
StringBuilder sb = new StringBuilder();
String delimiter = ", ";
for (Map.Entry<String, Integer> entry : map.entrySet()) {
sb.append(
String.format("%s:%d", entry.getKey(), entry.getValue())
).append(delimiter);
}
sb.setLength(sb.length()-delimiter.length());
return sb.toString();
}
}
Given your input, I got this
2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
2017-07 A:4, B:4, C:1, E:1, F:1, G:3
The main problem is about the sign of the reduce method :
I was writing : public void reduce(Text key, Iterator<TextWritable> values,
Context context)
instead of
public void reduce(Text key, Iterable<ArrayTextWritable> values,
This is the reason why i obtain my Map output instead of my Reduce otuput
来源:https://stackoverflow.com/questions/44399163/java-mapreduce-counting-by-date