Padding array manually | 易学教程

问题

I am trying to understand 9 point stencil's algorithm from this book , the logic is clear to me , but the calculation of WIDTHP macro is what i am unable to understand, here is the breif code (original code is more than 300 lines length!!):

#define PAD64 0
#define WIDTH 5900
#if PAD64
#define WIDTHP ((((WIDTH*sizeof(REAL))+63)/64)*(64/sizeof(REAL)))
#else
#define WIDTHP WIDTH
#endif
#define HEIGHT 10000

REAL *fa = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);
REAL *fb = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);

original array is 5900 X 10000, but if i define PAD64 , the array becomes 5915.75 X 10000

Though so far i can guess that the author is trying to align & pad array to 64 byte boundary. But array returned by malloc is usually aligned(& padded) , also, the posix_memalign gives you a chunk of memory that is guaranteed to have the requested alignment , we can also use

__attribute__((align(64)))

what impact does this WIDTHP can make on my code's performance?

回答1:

I was going to put this in as a comment to unwind's answer because he's right. But perhaps I can explain more clearly, albeit in more characters than will fit in a comment.

When I do the math, I get 5904 reals, which is 23616 bytes, which is 396 cache lines for 64 byte cache lines. It is the bytes, rather than the number of elements which must be a multiple of 64.

As to why you want to pad the value of width, lets look at a smaller example. Let's pretend we had a "cache line" that holds 10 letter and that we have an "array" with a width of 8 letters and height of 4. Now since our hypothetical array is in C and C is row major, the array will look something like this: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD

but what does it look like when it is arranged in cache lines, since those are 10 letters long: AAAAAAAABB BBBBBBCCCC CCCCDDDDDD DD

Not good. Only the first row of the array is aligned. But if we pad width by two spaces, we get this in cache: AAAAAAAA__ BBBBBBBB__ CCCCCCCC__ DDDDDDDD__

which is what we want. Now we can have a nested loop like

for i = 1 to height
   for j = 1 to width

and know that every time we start to work on the j loop, the data we need will be aligned.

Oh, and yes, they really should do something to make sure that the first element of the array is aligned. 'attribute((align(64)))' won't work because the arrays are being allocated dynamically but they could have used posix_memalign instead of malloc.

回答2:

The idea is that each row of the matrix (or column, if it's treated as a column-major matrix) can be aligned to the start of a new cache line, by adding padding to the end of the line. Exactly what impact this has depends of course a lot on the access pattern, but in general cache-friendliness can be quite important for intensely number-crunching code.

Also, the computation is integer, so the result is certainly not 5915.75, that doesn't make sense.