How to optimize this short factorial function in scala? (Creating 50000 BigInts)

前端 未结 4 1359
名媛妹妹
名媛妹妹 2021-02-13 06:41

I\'ve compaired the scala version

(BigInt(1) to BigInt(50000)).reduce(_ * _)

to the python version

reduce(lambda x,y: x*y, rang         


        
相关标签:
4条回答
  • 2021-02-13 07:28

    Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:

    scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
    Took: 4605 ms
    
    scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
    Took: 2004 ms
    

    Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)

    0 讨论(0)
  • 2021-02-13 07:31

    Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:

    def fact(n: Int): BigInt = rangeProduct(1, n)
    
    private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
      case 0 => BigInt(n1)
      case 1 => BigInt(n1 * n2)
      case 2 => BigInt(n1 * (n1 + 1)) * n2
      case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
      case _ => 
        val nm = (n1 + n2) >> 1
        rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
    }
    

    Also to get more speed use latest version of JDK and following JVM options:

    -server -XX:+TieredCompilation
    

    Bellow are results for Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:

    (BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
    (BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
    (BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
    (BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
    (BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
    (BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
    (BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
    (BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
    (BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
    (BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
    fact(100000) took: 203 ms with 28.3 % of CPU usage
    

    BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it

    0 讨论(0)
  • 2021-02-13 07:40

    Python on my machine:

    def func():
      start= time.clock()
      reduce(lambda x,y: x*y, range(1,50000))
      end= time.clock()
      t = (end-start) * 1000
      print t
    

    gives 1219 ms

    Scala:

    def timed[T](f: => T) = {
      val t0 = System.currentTimeMillis
      val r = f
      val t1 = System.currentTimeMillis
      println("Took: "+(t1 - t0)+" ms")
      r
    }
    
    timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
    4251 ms
    
    timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
    4224 ms
    
    timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
    2083 ms
    
    timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
    689 ms
    
    // using org.jscience.mathematics.number.LargeInteger from Travis's answer
    timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
    3327 ms
    
    timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
                                              LargeInteger.ONE)(_ times _) }
    361 ms
    

    This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.

    The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.

    So, experimentally, the way to optimize this function is

    a) Use fold rather than reduce

    b) Use parallel collections

    update: Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).

      import org.jscience.mathematics.number.LargeInteger
    
      def fact(n: Int) = {
        def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
          case 0 => throw new IllegalArgumentException
          case 1 => seq.head
          case _ => loop {
            val (a, b) = seq.splitAt(seq.length / 2)
            a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
          } 
        }
        loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
      }
    
    0 讨论(0)
  • 2021-02-13 07:42

    The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.

    The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:

    import org.jscience.mathematics.number.LargeInteger
    
    (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
    

    This is faster than the Python solution on my machine.


    Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:

                  benchmark   ms linear runtime
             BigIntFoldLeft 4774 ==============================
                 BigIntFold 4739 =============================
               BigIntReduce 4769 =============================
          BigIntFoldLeftPar 4642 =============================
              BigIntFoldPar  500 ===
            BigIntReducePar  499 ===
       LargeIntegerFoldLeft 3042 ===================
           LargeIntegerFold 3003 ==================
         LargeIntegerReduce 3018 ==================
    LargeIntegerFoldLeftPar 3038 ===================
        LargeIntegerFoldPar  246 =
      LargeIntegerReducePar  260 =
    

    I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.

    0 讨论(0)
提交回复
热议问题