Why is there a significant difference in this C++ for loop's execution time? [duplicate]

后端未结

关注

 8  1320

北海茫月

相关标签:

8条回答

遥遥无期

2021-01-30 20:51
The answer depends a little bit on exactly how the matrix is defined. In a fully dynamically allocated array, you'd have:
```
T **matrix;
matrix = new T*[n];
for(i = 0; i < n; i++)
{
   t[i] = new T[m]; 
}
```
So, every matrix[j] will require a new memory lookup for the pointer. If you do the j loop outside, the inner loop can re-use the pointer for matrix[j] every for the whole inner loop.

If the matrix is a simple 2D array:
```
T matrix[n][m];
```
then the matrix[j] will simply be a multiplication by 1024 * sizeof(T) - which can be done by adding 1024 * sizeof(T) the loop index in the optimised code, so should be relatively fast either way.

On top of that, we have cache locality factors. Caches have "lines" of data that is typically 32 to 128 bytes per line. So if your code reads address X, the cache will load up with values 32 to 128 bytes around X. So if the NEXT thing you need is only sizeof(T) forward from the current location, it's highly likely already in the cache [and modern processors also detects that you are going round in a loop reading every memory location, and pre-loads the data].

In the case of j inner loop, you are reading a new location of sizeof(T)*1024 distance for each loop [or possiblya greater distance if it's dynamically allocated]. This means that the data being loaded will not be useful for the next loop, because it's not in the next 32 to 128 bytes.

And finally, it's entirely possible that the first loop is more optimised, thanks to SSE instructions or similar, which allow the calculation to be run even quicker. But this is probably marginal for such a large matrix, as the performance is highly memory bound at this size.
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-01-30 20:51

Memory hardware is not optimized to deliver individual addresses: instead it tends to operate on larger chunks of continuous memory called cache lines. Every time you read one entry of your matrix, the entire cache line it lies in also gets loaded into cache along with it.

The faster loop ordering is set up to read memory in order; each time you load a cache line, you use all of the entries in that cache line. Each pass through the outer loop, you read each matrix entry only a single time.

The slower loop ordering, however, only uses a single entry from each cache line before moving on. Thus, each cache line has to get loaded multiple times, once for each matrix entry in the line. e.g. if a double is 8 byes and a cache line is 64 bytes long, then each pass through the outer loop has to read each matrix entry eight times rather than a single time.

All that said, if you had turned optimizations on, you would probably see no difference: optimizers understand this phenomenon, and good ones are capable of recognizing that they can swap which loop is the inner loop and which loop is the outer loop for this particular code snippet.

(also, a good optimizer would have only done one pass through the outermost loop, because it recognizes the first 999 times through are irrelevant to the final value of sum)

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题