Filtering Scala's Parallel Collections with early abort when desired number of results found

前端 未结 3 1704
终归单人心
终归单人心 2021-02-08 13:02

Given a very large instance of collection.parallel.mutable.ParHashMap (or any other parallel collection), how can one abort a filtering parallel scan once a giv

3条回答
  •  无人及你
    2021-02-08 13:29

    You could try to get an iterator and then create a lazy list (a Stream) where you filter (with your predicate) and take the number of elements you want. Because it is a non strict, this 'taking' of elements is not evaluated. Afterwards you can force the execution by adding ".par" to the whole thing and achieve parallelization.

    Example code:

    A parallelized map with random values (simulating your parallel hash map):

    scala> myMap
    res14: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(66978401 -> -1331298976, 256964068 -> 126442706, 1698061835 -> 1622679396, -1556333580 -> -1737927220, 791194343 -> -591951714, -1907806173 -> 365922424, 1970481797 -> 162004380, -475841243 -> -445098544, -33856724 -> -1418863050, 1851826878 -> 64176692, 1797820893 -> 405915272, -1838192182 -> 1152824098, 1028423518 -> -2124589278, -670924872 -> 1056679706, 1530917115 -> 1265988738, -808655189 -> -1742792788, 873935965 -> 733748120, -1026980400 -> -163182914, 576661388 -> 900607992, -1950678599 -> -731236098)
    

    Get an iterator and create a Stream from the iterator and filter it. In this case my predicate is only accepting pairs (of the value member of the map). I want to get 10 even elements, so I take 10 elements which will only get evaluated when I force it to:

    scala> val mapIterator = myMap.toIterator
    mapIterator: Iterator[(Int, Int)] = HashTrieIterator(20)
    
    
    scala> val r = Stream.continually(mapIterator.next()).filter(_._2 % 2 == 0).take(10)
    r: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), ?)
    

    Finally, I force the evaluation which only gets 10 elements as planned

    scala> r.force
    res16: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), (256964068,126442706), (1698061835,1622679396), (-1556333580,-1737927220), (791194343,-591951714), (-1907806173,365922424), (1970481797,162004380), (-475841243,-445098544), (-33856724,-1418863050), (1851826878,64176692))
    

    This way you only get the number of elements you want (without needing to process the remaining elements) and you parallelize the process without locks, atomics or breaks.

    Please compare this to your solutions to see if it is any good.

提交回复
热议问题