Java MapReduce counting by date

后端 未结 2 851
萌比男神i
萌比男神i 2021-01-28 09:16

I\'m new to Hadoop, and i\'m trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :

相关标签:
2条回答
  • 2021-01-28 10:02

    The main problem is about the sign of the reduce method :

    I was writing : public void reduce(Text key, Iterator<TextWritable> values, Context context)

    instead of

        public void reduce(Text key, Iterable<ArrayTextWritable> values,
    

    This is the reason why i obtain my Map output instead of my Reduce otuput

    0 讨论(0)
  • 2021-01-28 10:15

    I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).

    The following approach is going to turn these lines, for example

    2017-07-01 , A, B, A, C, B, E, F
    2017-07-05 , A, B, A, G, B, G, G
    

    Into this pair for the reducer

    2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
    

    In other words, you gain no real benefit by using an ArrayWritable, just keep it as text.


    So, the Mapper would look like this

    class CustomMap extends Mapper<LongWritable, Text, Text, Text> {
    
        private final Text key = new Text();
        private final Text output = new Text();
    
        @Override
        protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
    
            int separatorIndex = value.find(",");
    
            final String valueStr = value.toString();
            if (separatorIndex < 0) {
                System.err.printf("mapper: not enough records for %s", valueStr);
                return;
            }
            String dateKey = valueStr.substring(0, separatorIndex).trim();
            String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
    
            SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
            SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
    
            try {
                dateKey = fmtTo.format(fmtFrom.parse(dateKey));
                key.set(dateKey);
            } catch (ParseException ex) {
                System.err.printf("mapper: invalid key format %s", dateKey);
                return;
            }
    
            output.set(tokens);
            context.write(key, output);
        }
    }
    

    And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.

    class CustomReduce extends Reducer<Text, Text, Text, Text> {
    
        private final Text output = new Text();
    
        @Override
        protected void reduce(Text date, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    
            Map<String, Integer> keyMap = new TreeMap<>();
            for (Text v : values) {
                String[] keys = v.toString().trim().split(",");
    
                for (String key : keys) {
                    if (!keyMap.containsKey(key)) {
                        keyMap.put(key, 0);
                    }
                    keyMap.put(key, 1 + keyMap.get(key));
                }
            }
    
            output.set(mapToString(keyMap));
            context.write(date, output);
        }
    
        private String mapToString(Map<String, Integer> map) {
            StringBuilder sb = new StringBuilder();
            String delimiter = ", ";
            for (Map.Entry<String, Integer> entry : map.entrySet()) {
                sb.append(
                        String.format("%s:%d", entry.getKey(), entry.getValue())
                ).append(delimiter);
            }
            sb.setLength(sb.length()-delimiter.length());
            return sb.toString();
        }
    }
    

    Given your input, I got this

    2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
    2017-07 A:4, B:4, C:1, E:1, F:1, G:3
    
    0 讨论(0)
提交回复
热议问题