Akka with Frege running slower than Scala counterpart

As an exercise, I took these Scala and Java examples of Akka to port to Frege. While it works fine, it runs slower(11s) than Scala(540ms) counterpart.

module mmhelloworld.akkatutorialfregecore.Pi where
import mmhelloworld.akkatutorialfregecore.Akka

data PiMessage = Calculate | 
                Work {start :: Int, nrOfElements :: Int} |
                Result {value :: Double} | 
                PiApproximation {pi :: Double, duration :: Duration}

data Worker = private Worker where
    calculatePiFor :: Int -> Int -> Double
    calculatePiFor !start !nrOfElements = loop start nrOfElements 0.0 f where
        loop !curr !n !acc f = if n == 0 then acc
                               else loop (curr + 1) (n - 1) (f acc curr) f
        f !acc !i = acc + (4.0 * fromInt (1 - (i `mod` 2) * 2) / fromInt (2 * i + 1))

    onReceive :: Mutable s UntypedActor -> PiMessage -> ST s ()
    onReceive actor Work{start=start, nrOfElements=nrOfElements} = do
        sender <- actor.sender
        self <- actor.getSelf
        sender.tellSender (Result $ calculatePiFor start nrOfElements) self 

data Master = private Master {
    nrOfWorkers :: Int,
    nrOfMessages :: Int,
    nrOfElements :: Int,
    listener :: MutableIO ActorRef,
    pi :: Double,
    nrOfResults :: Int,
    workerRouter :: MutableIO ActorRef,
    start :: Long } where

    initMaster :: Int -> Int -> Int -> MutableIO ActorRef -> MutableIO UntypedActor -> IO Master
    initMaster nrOfWorkers nrOfMessages nrOfElements listener actor = do
        props <- Props.forUntypedActor Worker.onReceive
        router <- RoundRobinRouter.new nrOfWorkers
        context <- actor.getContext
        workerRouter <- props.withRouter router >>= (\p -> context.actorOf p "workerRouter")
        now <- currentTimeMillis ()
        return $ Master nrOfWorkers nrOfMessages nrOfElements listener 0.0 0 workerRouter now

    onReceive :: MutableIO UntypedActor -> Master -> PiMessage -> IO Master
    onReceive actor master Calculate = do
        self <- actor.getSelf
        let tellWorker start = master.workerRouter.tellSender (work start) self
            work start = Work (start * master.nrOfElements) master.nrOfElements
        forM_ [0 .. master.nrOfMessages - 1] tellWorker
        return master
    onReceive actor master (Result newPi) = do
        let (!newNrOfResults, !pi) = (master.nrOfResults + 1, master.pi + newPi)
        when (newNrOfResults == master.nrOfMessages) $ do
            self <- actor.getSelf
            now <- currentTimeMillis ()
            duration <- Duration.create (now - master.start) TimeUnit.milliseconds
            master.listener.tellSender (PiApproximation pi duration) self
            actor.getContext >>= (\context -> context.stop self)
        return master.{pi=pi, nrOfResults=newNrOfResults}

data Listener = private Listener where
    onReceive :: MutableIO UntypedActor -> PiMessage -> IO ()
    onReceive actor (PiApproximation pi duration) = do
        println $ "Pi approximation: " ++ show pi
        println $ "Calculation time: " ++ duration.toString
        actor.getContext >>= ActorContext.system >>= ActorSystem.shutdown

calculate nrOfWorkers nrOfElements nrOfMessages = do
    system <- ActorSystem.create "PiSystem"
    listener <- Props.forUntypedActor Listener.onReceive >>= flip system.actorOf "listener"
    let constructor = Master.initMaster nrOfWorkers nrOfMessages nrOfElements listener
        newMaster = StatefulUntypedActor.new constructor Master.onReceive
    factory <- UntypedActorFactory.new newMaster
    masterActor <- Props.fromUntypedFactory factory >>= flip system.actorOf "master"
    masterActor.tell Calculate
    getLine >> return () --Not to exit until done

main _ = calculate 4 10000 10000

Am I doing something wrong with Akka or is it something to do with laziness in Frege for being slow? For example, when I initially had fold(strict fold) in place of loop in Worker.calculatePiFor, it took 27s.

Dependencies:

Akka native definitions for Frege: Akka.fr
Java helper to extend Akka classes since we cannot extend a class in Frege: Actors.java

I am not exactly familiar with Actors, but assuming that the tightest loop is indeed loop you could avoid passing function f as argument.

For one, applications of passed functions cannot take advantage of the strictness of the actual passed function. Rather, code generation must assume conservatively that the passed function takes its arguments lazily and returns a lazy result.

Second, in our case you use f really just once here, so one can inline it. (This is how it is done in the scala code in the article you linked.)

Look at the code generated for the tail recursion in the following sample code that mimics yours:

test b c = loop 100 0 f 
   where 
      loop 0 !acc f = acc
      loop n !acc f = loop (n-1) (acc + f (acc-1) (acc+1)) f   -- tail recursion
      f x y = 2*x + 7*y

We get there:

// arg2$f is the accumulator
arg$2 = arg$2f + (int)frege.runtime.Delayed.<java.lang.Integer>forced(
      f_3237.apply(PreludeBase.INum_Int._minusƒ.apply(arg$2f, 1)).apply(
            PreludeBase.INum_Int._plusƒ.apply(arg$2f, 1)
          ).result()
    );

You see here that f is called lazily which causes all the argument expressios to also be computed lazily. Note the number of method calls this requires! In your case the code should still be something like:

(double)Delayed.<Double>forced(f.apply(acc).apply(curr).result())

This means, two closures are build with the boxed values acc and curr and then the result is computed, i.e. the function f gets called with the unboxed arguments, and the result gets again boxed, just to get unboxed again (forced) for the next loop.

Now compare the following, where we just do not pass f but call it directly:

test b c = loop 100 0 
    where 
      loop 0 !acc = acc
      loop n !acc = loop (n-1) (acc + f (acc-1) (acc+1)) 
      f x y = 2*x + 7*y

We get:

arg$2 = arg$2f + f(arg$2f - 1, arg$2f + 1);

Much better! Finally, in the case above we can do without a function call at all:

      loop n !acc = loop (n-1) (acc + f) where
        f = 2*x + 7*y
        x = acc-1
        y = acc+1

And this gets:

final int y_3236 = arg$2f + 1;
final int x_3235 = arg$2f - 1;
...
arg$2 = arg$2f + ((2 * x_3235) + (7 * y_3236));

Please try this out and let us know what happens. The main boost in performance should come from not passing f, whereas the inlining will probably be done in the JIT anyway.

The additional cost with fold is probably because you also had to create some list before applying it.

来源：https://stackoverflow.com/questions/17256979/akka-with-frege-running-slower-than-scala-counterpart

标签

scala

akka

frege