Iterate twice on values (MapReduce)

前端 未结 11 975
轮回少年
轮回少年 2020-11-29 07:22

I receive an iterator as argument and I would like to iterate on values twice.

public void reduce(Pair key, Iterator          


        
相关标签:
11条回答
  • 2020-11-29 08:01

    you can do that

    MarkableIterator<Text> mitr = new MarkableIterator<Text>(values.iterator());
    mitr.mark();
    while (mitr.hasNext()) 
    {
    //do your work
    }
    mitr.reset();
    while(mitr.hasNext()) 
    {
    //again do your work
    }
    
    1. Reference Link 2

    2. Reference Link 2

    0 讨论(0)
  • 2020-11-29 08:06

    Notice: if you use the cache list to cache the item, you should clone the item first then add to the cache. Otherwise you will find all the item is the same in the cache.

    The situation is caused by the memory optimization of the MapReduce, In the reduce method, the Iterable reuse the item instance, for more detail can find here

    0 讨论(0)
  • 2020-11-29 08:14

    Unfortunately this is not possible without caching the values as in Andreas_D's answer.

    Even using the new API, where the Reducer receives an Iterable rather than an Iterator, you cannot iterate twice. It's very tempting to try something like:

    for (IntWritable value : values) {
        // first loop
    }
    
    for (IntWritable value : values) {
        // second loop
    }
    

    But this won't actually work. The Iterator you receive from that Iterable's iterator() method is special. The values may not all be in memory; Hadoop may be streaming them from disk. They aren't really backed by a Collection, so it's nontrivial to allow multiple iterations.

    You can see this for yourself in the Reducer and ReduceContext code.

    Caching the values in a Collection of some sort may be the easiest answer, but you can easily blow the heap if you are operating on large datasets. If you can give us more specifics on your problem, we may be able to help you find a solution that doesn't involve multiple iterations.

    0 讨论(0)
  • 2020-11-29 08:15

    If method signature cannot be changed then I would suggest using Apache Commons IteratorUtils to convert Iterator to ListIterator. Consider this example method for iterating twice on values:

    void iterateTwice(Iterator<String> it) {
        ListIterator<?> lit = IteratorUtils.toListIterator(it);
        System.out.println("Using ListIterator 1st pass");
        while(lit.hasNext())
            System.out.println(lit.next());
    
        // move the list iterator back to start
        while(lit.hasPrevious())
            lit.previous();
    
        System.out.println("Using ListIterator 2nd pass");
        while(lit.hasNext())
            System.out.println(lit.next());
    }
    

    Using code like above I was able to iterate over the list of values without saving a copy of List elements in my code.

    0 讨论(0)
  • 2020-11-29 08:17

    After searching and doing so many tries and errors, I found a solution.

    1. Declare a new collection (say cache) (linked list or Arraylist or any else)

    2. Inside first iteration, assign the current iterator like below example:

      cache.add(new Text(current.get()))  
      
    3. Iterate through cache:

      for (Text count : counts) {
          //counts is iterable object of Type Text
          cache.add(new Text(count.getBytes()));
      }
      for(Text value:cache) {
          // your logic..
      }
      
    0 讨论(0)
  • 2020-11-29 08:20

    Try this:

        ListIterator it = list.listIterator();
    
        while(it.hasNext()){
    
            while(it.hasNext()){
                System.out.println("back " + it.next() +" "); 
            }
            while(it.hasPrevious()){
                it.previous();
            }
        }
    
    0 讨论(0)
提交回复
热议问题