Will multi threading provide any performance boost?

前端 未结 19 832
说谎
说谎 2021-02-05 13:49

I am new to programming in general so please keep that in mind when you answer my question.

I have a program that takes a large 3D array (1 billion elements) and sums up

19条回答
  •  挽巷
    挽巷 (楼主)
    2021-02-05 14:18

    How does your code work. Does it go like this?

    for each row: add up the values
    for each column: add up the values
    for each stack: add up the values
    

    If so, you might want to read up on "locality of reference". Depending how your data is stored, you might find that while you're doing the stacks, a whole cache line has to be pulled in for each value, because the values are nowhere near each other in memory. In fact, with a billion values, you could be pulling things all the way from disk. Sequential access with a long stride (distance between values) is the worst possible use for cache. Try profiling, and if you see that adding up the stacks is taking longer than adding up the rows, this is almost certainly why.

    I think you could be saturating the memory bus(*), in which case multithreading would only help if core2 quad uses different buses for different cores. But if you're not saturating the bus bandwidth, you can't get best performance this way even once you multi-thread. You'll have 4 cores spending all their time stalled on cache misses instead of one.

    If you are memory cache bound, then your goal should be to visit each page/line of memory as few times as possible. So I'd try things like running over the data once, adding each value to three different totals as you go. If that runs faster on a single core, then we're in business. The next step is that with a 1000x1000x1000 cube, you have 3 million totals on the go. That doesn't fit in cache either, so you have to worry about the same cache miss problems writing as you do reading.

    You want to make sure that as you run along a row of 1000 adjacent values in RAM adding to the row total that they all share, you're also adding to adjacent totals for the columns and stacks (which they don't store). So the "square" of column totals should be stored in the appropriate way, as should the "square" of stacks. That way you deal with 1000 of your billion values just by pulling about 12k of memory into cache (4k for 1000 values, plus 4k for 1000 column totals, plus 4k for 1000 stack totals). As against that, you're doing more stores than you would be by concentrating on 1 total at a time (which therefore could be in a register).

    So I don't promise anything, but I think it's worth looking at order of memory access, whether you multi-thread or not. If you can do more CPU work while accessing only a relatively small amount of memory, then you'll speed up the single-threaded version but also put yourself in much better shape for multi-threading, since the cores share a limited cache, memory bus, and main RAM.

    (*) Back of envelope calculation: in random random reviews off the internet the highest estimated FSB bandwidth for Core2 processors I've found so far is an Extreme at 12GB/s, with 2 channels at 4x199MHz each). Cache line size is 64 bytes, which is less than your stride. So summing a column or stack the bad way, grabbing 64 bytes per value, would only saturate the bus if it was doing 200 million values per second. I'm guessing it's nothing like this fast (10-15 seconds for the whole thing), or you wouldn't be asking how to speed it up.

    So my first guess was probably way off. Unless your compiler or CPU has inserted some very clever pre-fetching, a single core cannot be using 2 channels and 4 simultaneous transfers per cycle. For that matter, 4 cores couldn't use 2 channels and 4 simultaneous transfers. The effective bus bandwidth for a series of requests might be much lower than the physical limit, in which case you would hope to see good improvements from multi-threading simply because you have 4 cores asking for 4 different cache lines, all of which can be loaded simultaneously without troubling the FSB or the cache controller. But the latency is still the killer, and so if you can load less than one cache line per value summed, you'll do much better.

提交回复
热议问题