Haskell Thrift library 300x slower than C++ in performance test

前端未结

关注

 5  1641

I\'m building an application which contains two components - server written in Haskell, and client written in Qt (C++). I\'m using thrift to communicate them, and I wonder why i

相关标签:

5条回答

慢半拍i

2021-01-30 05:48
This is fairly consistent with what user13251 says: The haskell implementation of thrift implies a large number of small reads.

EG: In Thirft.Protocol.Binary
```
readI32 p = do
    bs <- tReadAll (getTransport p) 4
    return $ Data.Binary.decode bs
```
Lets ignore the other odd bits and just focus on that for now. This says: "to read a 32bit int: read 4 bytes from the transport then decode this lazy bytestring."

The transport method reads exactly 4 bytes using the lazy bytestring hGet. The hGet will do the following: allocate a buffer of 4 bytes then use hGetBuf to fill this buffer. hGetBuf might be using an internal buffer, depends on how the Handle was initialized.

So there might be some buffering. Even so, this means Thrift for haskell is performing the read/decode cycle for each integer individually. Allocating a small memory buffer each time. Ouch!

I don't really see a way to fix this without the Thrift library being modified to perform larger bytestring reads.

Then there are the other oddities in the thrift implementation: Using a classes for a structure of methods. While they look similar and can act like a structure of methods and are even implemented as a structure of methods sometimes: They should not be treated as such. See the "Existential Typeclass" antipattern:
- http://lukepalmer.wordpress.com/2010/01/24/haskell-antipattern-existential-typeclass/
One odd part of the test implementation:
- generating an array of Ints only to immediately change them to Int32s only to immediately pack into a Vector of Int32s. Generating the vector immediately would be sufficient and faster.
Though, I suspect, this is not the primary source of performance issues.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-01-30 05:53

The Haskell implementation of the basic thrift server you're using uses threading internally, but you didn't compile it to use multiple cores.

To do the test again using multiple cores, change your command line for compiling the Haskell program to include -rtsopts and -threaded, then run the final binary like ./Main -N4 &, where 4 is the number of cores to use.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-01-30 06:00

You should take a look at Haskell profiling methods to find what resources your program uses/allocates and where.

The chapter on profiling in Real World Haskell is a good starting point.

0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2021-01-30 06:01
Everyone is pointing out that is the culprit is the thrift library, but I'll focus on your code (and where I can help getting some speed)

Using a simplified version of your code, where you calculate itemsv:
```
testfunc mtsize =  itemsv
  where size = i32toi $ fromJust mtsize
        item i = Item (Just $ Vector.fromList $ map itoi32 [i..100])
        items = map item [0..(size-1)]
        itemsv = Vector.fromList items 
```
First, you have many intermediate data being created in item i. Due to lazyness, those small and fast to calculate vectors becomes delayed thunks of data, when we could had them right away.

Having 2 carefully placed $!, that represent strict evaluation :
```
 item i = Item (Just $! Vector.fromList $! map itoi32 [i..100])
```
Will give you a 25% decrease in runtime (for size 1e5 and 1e6).

But there is a more problematic pattern here: you generate a list to convert it as a vector, in place of building the vector directly.

Look those 2 last lines, you create a list -> map a function -> transform into a vector.

Well, vectors are very similar to list, you can do something similar! So you'll have to generate a vector -> vector.map over it and done. No more need to convert a list into a vector, and maping on vector is usually faster than a list!

So you can get rid of items and re-write the following itemsv:
```
  itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)
```
Reapplying the same logic to item i, we eliminate all lists.
```
testfunc3 mtsize = itemsv
   where 
      size = i32toi $! fromJust mtsize
      item i = Item (Just $!  Vector.enumFromN (i::Int32) (100- (fromIntegral i)))
      itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)
```
This has a 50% decrease over the initial runtime.
0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2021-01-30 06:03

I don't see any reference to buffering in the Haskell server. In C++, if you don't buffer, you incur one system call for every vector/list element. I suspect the same thing is happening in the Haskell server.

I don't see a buffered transport in Haskell directly. As an experiment, you may want to change both the client and server to use a framed transport. Haskell does have a framed transport, and it is buffered. Note that this will change the wire layout.

As a separate experiment, you may want to turn -off- buffering for C++ and see if the performance numbers are comparable.

0 讨论(0)
发布评论:

提交评论
- 加载中...