How to copy iterator in Scala?

后端 未结 4 949
星月不相逢
星月不相逢 2021-01-01 17:03

About duplicate

This is NOT a duplicate of How to clone an iterator?

Please do not blindly close this question, all the answers given in so-called

相关标签:
4条回答
  • 2021-01-01 17:19

    You can't duplicate an iterator without destroying it. The contract for iterator is that it can only be traversed once.

    The question you linked to shows how to get two copies in exchange for the one you've destroyed. You cannot keep using the original, but you can now run the two new copies forward independently.

    0 讨论(0)
  • 2021-01-01 17:27

    As Rex said, it is impossible to make a copy of an Iterator without destroying it. That said, what is the problem with duplicate?

    var list = List(1,2,3,4,5)
    var it1 = list.iterator
    it1.next()
    
    val (it1a, it1b) = it1.duplicate
    it1 = it1a
    var it2 = it1b
    it2.next()
    
    println(it1.next())
    
    0 讨论(0)
  • 2021-01-01 17:31

    I think this is a very good question, it's a pity that many one doesn't understood the value of the problem. In the age of Big Data there are a lot of situation that we have a stream, not an allocated list of the data that cannot be collected or fit into memory. And the repeating of it from the very begin is costly too. What we can do if we need two (or more) separate calculation with the data? For example we may need to calculate min, max, sum, md5 etc using already written functions with only one pass through in the different threads.

    The general solution is to use Akka-Stream. This will do it. But is it possible with Iterator, that is the easiest way in Java/Scala to represent such streaming data source? The answer is yes, although we "could NOT proceed with original and copy completely independently" in meaning that we have to synchronize the speeds of consumption of each consumer thread. (Akka-Stream do this leveraging using back-pressure and some intermediate buffers).

    So here is my easy solution: to use Phaser. With it we can make Iterator wrapper over one-pass source. This object are to use in each consumer thread as simple Iterator. Using it you are to know the number of consuming threads in advance. Also each consumer-thread MUST drain the source until the end to avoid the hang of all overs (using flush() method for example).

    import java.util.concurrent.Phaser
    import java.util.concurrent.atomic.AtomicBoolean
    
    // it0 - input source iterator
    // num - exact number of consuming threads. We have to know it in advance.
    case class ForkableIterator[+A]( it0: Iterator[A], num: Int ) extends Phaser(num) with Iterator[A] {
    
      val it = it0.flatMap( Stream.fill(num)(_) )  // serial replicator
    
      private var hasNext0 = new AtomicBoolean( it0.hasNext )
      override def hasNext: Boolean = hasNext0.get()
    
      override def next(): A = {
        arriveAndAwaitAdvance()
        val next = it.synchronized {
          val next = it.next()
          if (hasNext0.get) hasNext0.set(it.hasNext)
          next
        }
        arriveAndAwaitAdvance() // otherwise the tasks locks at the end the last data element
        next
      }
    
      // In case that a consumer gives up to read before the end of its source data stream occurs
      // it HAVE to drain the last to avoid block others. (Note: Phaser has no "unregister" method?).
      // Calling it may be avoided if all consumers read exactly the same amount of data,
      // e.g. until the very end of it.
      def flush(): Unit = while (hasNext) next()
    }
    

    PS This "ForkableIterator" was successfully used by me with Spark to perform several independent aggregations over long stream of source data. In such case I have no bother about creating threads manually. You may also use Scala Futures / Monix Tasks etc.

    PSPS I recheck the JDK Phaser specification now and find that It actually has "unregister" method called arriveAndDeregister(). So use it instead of flush() if a consumer complete.

    0 讨论(0)
  • 2021-01-01 17:38

    It's pretty easy to create a List iterator that you can duplicate without destroying it: this is basically the definition of the iterator method copied from the List source with a fork method added:

    class ForkableIterator[A] (list: List[A]) extends Iterator[A] {
        var these = list
        def hasNext: Boolean = !these.isEmpty
        def next: A = 
          if (hasNext) {
            val result = these.head; these = these.tail; result
          } else Iterator.empty.next
        def fork = new ForkableIterator(these)
    }
    

    Use:

    scala> val it = new ForkableIterator(List(1,2,3,4,5,6))
    it: ForkableIterator[Int] = non-empty iterator
    
    scala> it.next
    res72: Int = 1
    
    scala> val it2 = it.fork
    it2: ForkableIterator[Int] = non-empty iterator
    
    scala> it2.next
    res73: Int = 2
    
    scala> it2.next
    res74: Int = 3
    
    scala> it.next
    res75: Int = 2
    

    I had a look at doing this for HashMap but it seems more complicated (partly because there are different map implementations depending on collection size). So probably best to use the above implementation on yourMap.toList.

    0 讨论(0)
提交回复
热议问题