Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

前端未结

关注

 10  1481

I am making some matrix multiplication benchmarking, as previously mentioned in Why is MATLAB so fast in matrix multiplication?

Now I\'ve got another issue, when mu

相关标签:

10条回答

情深已故

2020-11-29 17:59

Given that the time is dropping at larger sizes wouldn't it be more likely to be cache conflicts, especially with powers of 2 for the problematic matrix sizes? I am no expert on caching issues, but excellent info on cache related performance issues here.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-11-29 18:00

Louis Brandy wrote two blog posts analyzing exactly this issue:

More Cache Craziness and Computational Performance - A beginners case study with some interesting statistics and attempts to explain the behavior in more detail, it does indeed come down to cache size limitations.

0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-11-29 18:02

As you are accessing the matice2 array vertically, it will be swapped in and out of the cache a lot more. If you mirror the array diagonally, so that you can access it using [k,m] instead of [m,k], the code will run a lot faster.

I tested this for 1024x1024 matrices, and it is about twice as fast. For 2048x2048 matrices it's about ten times faster.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-11-29 18:04

This may have to do with the size of your cpu cache. If 2 rows of the matrix matrix do not fit, then you will loose time swapping in elements from RAM. The extra 4095 elements may just be enough to prevent rows from fitting.

In your case, 2 rows for 2047 2d matrices fall within 16KB of memory (assuming 32 bit types). For example, if you have an L1 cache (closest to the cpu on the bus) of 64KB, then you can fit at least 4 rows (of 2047 * 32) into the cache at once. With the longer rows if there is any padding required that pushes the pairs of rows beyond 16KB, then things start to get messy. Also, each time you 'miss' the cache, swapping in data from another cache or main memory delays things.

My guess is that the variance in run times you're seeing with the different sized matrices is affected by how effectively the operating system can make use of the available cache (and some combinations are just problematic). Of course this is all a gross simplification on my part.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2