Time performance when permuting and casting double to float

后端未结

关注

 3  1568

I have some big arrays given by MATLAB to C++ (therefore I need to take them as they are) that needs casting and permuting (row-mayor, column mayor issues).

The arr

相关标签:

3条回答

滥情空心

2021-01-05 07:40
The problem in this example is cache locality. Looking at the assignment, the fastest-changing index is j but this has the largest effect on the address of the write in the target array:
```
img[i + k*size_proj[1] + j*size_proj[0] * size_proj[1]] = 
```
Notice that j is multiplied by 2 constants. Every change in j is likely to cause the result to be written to a new cache line.

The solution is this case is to invert the order of the loops:
```
    const auto K = size_proj[0];
    const auto I = size_proj[1];
    const auto J = size_proj[2];
    for (int j = 0; j < J; j++)
        for (int i = 0; i < I; i++)
            for (int k = 0; k < K; k++)
                img[i + k * I  + j * K * I] = (float)imgaux[k + i * K + j * K * I];
```
Which (on my laptop) brings us down from:
```
Time permuting and casting the input 4.416232
```
to:
```
Time permuting and casting the input 0.844341
```
Which I think you'll agree is something of an improvement.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2021-01-05 07:41
In terms of the algorithm you're using, I think you're always going to end up with three nested loops.

Two things to think about:
- In your innermost loop, there's some values that you calculate every time, which don't change every iteration. The compiler may cache them, but maybe not, so try moving them to the highest possible scope.
  - k * size_proj[1]
  - i * size_proj[1]
  - size_proj[0] * size_proj[1] (and j * size_proj[0] * size_proj[1] is used twice)
- Think about how your arrays are laid out in memory. Is there a way you can reorder your loops so that you're reading and writing to continuous regions of memory as much as possible? You'll get fewer cache misses and better performance if you read and write to continuous regions.
0 讨论(0)
发布评论:

提交评论
- 加载中...

刺人心

2021-01-05 07:51

Ok, let's unravel your loop a little bit by precalculating things ASAP:

int max0 = size_proj[0];
int max1 = size_proj[1];
int max2 = size_proj[2];

for (int k = 0; k < max0; k++)
{
    int kOffset1 = k*max1;
    int kOffset2 = k;

    for (int i = 0; i < max1; i++)
    {
        int iOffset1 = i;
        int iOffset2 = i*max0;

        for (int j = 0; j < max2; j++)
        {
            int jOffset1 = j*max0*max1;
            int jOffset2 = j*max0*max1;


            int idx1 = iOffset1 + jOffset1 + kOffset1;
            int idx2 = iOffset2 + jOffset2 + kOffset2;
            img[idx1] = (float)imgaux[idx2];
        }
    }
}

The calculation for jOffset1/2 seems to be suboptimal being on the lowest level of your nested loop. This always makes the idx1/2 value jump for max0*max1 every iteration. So let's move this to the highest level:

int max0 = size_proj[0];
int max1 = size_proj[1];
int max2 = size_proj[2];
for (int j = 0; j < max2; j++)
{
    int jOffset1 = j*max0*max1;
    int jOffset2 = j*max0*max1;

    for (int k = 0; k < max0; k++)
    {
        int kOffset1 = k*max1;
        int kOffset2 = k;

        for (int i = 0; i < max1; i++)
        {
            int iOffset1 = i;
            int iOffset2 = i*max0;

            int idx1 = iOffset1 + jOffset1 + kOffset1;
            int idx2 = iOffset2 + jOffset2 + kOffset2;
            img[idx1] = (float)imgaux[idx2];
        }
    }
}

That already looks better. kOffset1/2 and iOffset1/2 can't be optimized anymore, but we still have unecessary values and declarations. Let's sum these up:

for (int j = 0; j < size_proj[2]; j++)
{
    int jOffset = j*size_proj[0]*size_proj[1];
    for (int k = 0; k < size_proj[0]; k++)
    {
        int kOffset1 = k*size_proj[1];
        for (int i = 0; i < size_proj[1]; i++)
        {
            int iOffset2 = i*size_proj[0];
            img[i + jOffset + kOffset1] = (float)imgaux[iOffset2 + jOffset + k];
        }
    }
}

I tried your updated MVCE with your loop and with mine (same system using MSVC14):

Yours:

Time permuting and casting the input 4.180000

Mine:

Time permuting and casting the input 0.704000

Hopefully I didn't mess anything up ;-)

As @BarryTheHatchet pointed out and as it is easily overseen in the comment section: Instead of using an array of 3 int values for size_proj you better use three const int values.

Not using an array will remove complexity from your code (using descriptive names of course) The use of const will prevent you from accidentially changing values in complex calculation and may allow the compiler for better optimization.

As @paddy pointed out: You may replace the multiplications at the different levels of your nested loop with calculations by precalculating the step sizes.

I had tried this but there wasn't any real change in the multiplication version and step version....

const int jStep     = size_proj[0] * size_proj[1];
const int jStepMax  = size_proj[0] * size_proj[1] * size_proj[2];
const int kStep1 = size_proj[1];
const int kStep1Max = size_proj[0] * size_proj[1];
const int kStep2 = 1;
const int kStep2Max = size_proj[0];
const int iStep1 = 1;
const int iStep1Max = size_proj[1];
const int iStep2 = size_proj[0];
const int iStep2Max = size_proj[0] * size_proj[1];

for (int jOffset = 0; jOffset < jStepMax; jOffset += jStep)
{
    for (int kOffset1 = 0, kOffset2=0; kOffset1 < kStep1Max && kOffset2 < kStep2Max; kOffset1+=kStep1, kOffset2+=kStep2)
    {
        for (int iOffset1 = 0, iOffset2 = 0; iOffset1 < iStep1Max && iOffset2 < iStep2Max; iOffset1 += iStep1, iOffset2 += iStep2)
        {
            img[iOffset1 + jOffset + kOffset1] = (float)imgaux[iOffset2 + jOffset + kOffset2];
        }
    }
}

0 讨论(0)