hadoop inverted-index without recurrence of file names

青春壹個敷衍的年華 提交于 2019-12-04 18:12:48

You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.

public void reduce(Text key, Iterator<Text> values,
        OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {

    // Text's equals() method should be overloaded to make this work
    Set<Text> outputValues = new HashSet<Text>();

    while (values.hasNext()) {
      // make a new Object because Hadoop may mess with original
      Text value = new Text(values.next());

      // takes care of removing duplicates
      outputValues.add(value);
    }

    boolean first = true;
    StringBuilder toReturn = new StringBuilder();
    Iterator<Text> outputIter = outputValues.iter();
    while (outputIter.hasNext()) {
        if (!first) {
            toReturn.append(", ");
        }
        first = false;
        toReturn.append(outputIter.next().toString());
    }

    output.collect(key, new Text(toReturn.toString()));
}

Edit: Adds copy of value to Set as per Chris' comment.

You can improve performance by doing local map aggregation and introducing a combiner - basically you want to reduce the amount of data being transmitted between your mappers and reducers

Local map aggregation is a concept where by you maintain a LRU like map (or set) of output pairs. In your case a set of words for the current mapper document (assuming you have a single document per map). This way you can lookup the word in the set, and only output a K,V pair if the set doesn't already contain that word (indicating you haven't already output an entry for it). If the set doesn't contain the word, output the word, docid pair, and update the set with the word.

If the set get's too big (say 5000 or 10000 entries), then clear it out and start over. This way you'll see the number of values output from the mapper dramatically (if your value domain or set of values is small, words are a good example for this).

You can also introduce your reducer logic in the combiner phase too

Once final word of warning - be vary careful about adding the Key / Value objects into sets (like in Matt D's answer), hadoop re-uses objects under the hood, so don't be surprised if you get unexpected results if you add in the references - always create a copy of the object.

There's an article on local map aggregation (for the word count example) that you may find useful:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!