Scaling issues with OpenMP

问题

I have written a code for a special type of 3D CFD Simulation, the Lattice-Boltzmann Method (quite similar to a code supplied with the Book "The Lattice Boltzmann Method" by Timm Krüger et alii). Multithreading the program with OpenMP I have experienced issues that I can't quite understand: The results prove to be strongly dependent on the overall domain size.

The basic principle is that each cell of a 3D domain gets assigned certain values for 19 distribution functions (0-18) in discrete directions. They are laid down in two linear arrays allocated in the heap (one population is layed out in a separte array): The 18 populations of a certain cell are contiguous in memory, the values of consecutive x-values lay next to each other and so on (so sort of row-major: populations->x->y->z). Those distribution functions redistribute according to certain values within the cell and then get streamed to the neighbouring cells. For this reason I have two populations f1 and f2. The algorithm takes the values from f1, redistributes them and copies them into f2. Then the pointers are swapped and the algorithm starts again. The code is working perfectly fine on a single core but when I try to parallelise it on multiple cores I get a performance that depends on the overall size of the domain: For very small domains (10^3 cells) the algorithm is comparably slow with 15 million cells per second, for quite small domains (30^3 cells) the algorithm is about quite fast with over 60 million cells per second and for anything larger than that the performance drops again to about 30 million cells per second. Executing the code on a single core only leads to the same performance of about 15 million cells per second. These results of course vary between different processors but qualitatively the same issue remains!

The core of the code boils down to this parallelised loop that is executed over and over again and the pointers to f1 and f2 are swapped:

#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static)
    for(unsigned int z = 0; z < NZ; ++z)
    {
        for(unsigned int y = 0; y < NY; ++y)
        {
            for(unsigned int x = 0; x < NX; ++x)
            {
                /// temporary populations
                double ft0  = f0[D3Q19_ScalarIndex(x,y,z)];
                double ft1  = f1[D3Q19_FieldIndex(x,y,z,1)];
                double ft2  = f1[D3Q19_FieldIndex(x,y,z,2)];
                double ft3  = f1[D3Q19_FieldIndex(x,y,z,3)];
                double ft4  = f1[D3Q19_FieldIndex(x,y,z,4)];
                double ft5  = f1[D3Q19_FieldIndex(x,y,z,5)];
                double ft6  = f1[D3Q19_FieldIndex(x,y,z,6)];
                double ft7  = f1[D3Q19_FieldIndex(x,y,z,7)];
                double ft8  = f1[D3Q19_FieldIndex(x,y,z,8)];
                double ft9  = f1[D3Q19_FieldIndex(x,y,z,9)];
                double ft10 = f1[D3Q19_FieldIndex(x,y,z,10)];
                double ft11 = f1[D3Q19_FieldIndex(x,y,z,11)];
                double ft12 = f1[D3Q19_FieldIndex(x,y,z,12)];
                double ft13 = f1[D3Q19_FieldIndex(x,y,z,13)];
                double ft14 = f1[D3Q19_FieldIndex(x,y,z,14)];
                double ft15 = f1[D3Q19_FieldIndex(x,y,z,15)];
                double ft16 = f1[D3Q19_FieldIndex(x,y,z,16)];
                double ft17 = f1[D3Q19_FieldIndex(x,y,z,17)];
                double ft18 = f1[D3Q19_FieldIndex(x,y,z,18)];

                /// microscopic to macroscopic
                double r    = ft0 + ft1 + ft2 + ft3 + ft4 + ft5 + ft6 + ft7 + ft8 + ft9 + ft10 + ft11 + ft12 + ft13 + ft14 + ft15 + ft16 + ft17 + ft18;
                double rinv = 1.0/r;
                double u    = rinv*(ft1 - ft2 + ft7 + ft8  + ft9   + ft10 - ft11 - ft12 - ft13 - ft14);
                double v    = rinv*(ft3 - ft4 + ft7 - ft8  + ft11  - ft12 + ft15 + ft16 - ft17 - ft18);
                double w    = rinv*(ft5 - ft6 + ft9 - ft10 + ft13 -  ft14 + ft15 - ft16 + ft17 - ft18);

                /// collision & streaming
                double trw0 = omega*r*w0;                   //temporary variables
                double trwc = omega*r*wc;
                double trwd = omega*r*wd;
                double uu   = 1.0 - 1.5*(u*u+v*v+w*w);

                double bu = 3.0*u;
                double bv = 3.0*v;
                double bw = 3.0*w;

                unsigned int xp = (x + 1) % NX;             //calculate x,y,z coordinates of neighbouring cells
                unsigned int yp = (y + 1) % NY;
                unsigned int zp = (z + 1) % NZ;
                unsigned int xm = (NX + x - 1) % NX;
                unsigned int ym = (NY + y - 1) % NY;
                unsigned int zm = (NZ + z - 1) % NZ;

                f0[D3Q19_ScalarIndex(x,y,z)]      = bomega*ft0  + trw0*(uu);                        //redistribute distribution functions and stream to neighbouring cells
                double cu = bu;
                f2[D3Q19_FieldIndex(xp,y, z,  1)] = bomega*ft1  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu;
                f2[D3Q19_FieldIndex(xm,y, z,  2)] = bomega*ft2  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bv;
                f2[D3Q19_FieldIndex(x, yp,z,  3)] = bomega*ft3  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv;
                f2[D3Q19_FieldIndex(x, ym,z,  4)] = bomega*ft4  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bw;
                f2[D3Q19_FieldIndex(x, y, zp, 5)] = bomega*ft5  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bw;
                f2[D3Q19_FieldIndex(x, y, zm, 6)] = bomega*ft6  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bu+bv;
                f2[D3Q19_FieldIndex(xp,yp,z,  7)] = bomega*ft7  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu-bv;
                f2[D3Q19_FieldIndex(xp,ym,z,  8)] = bomega*ft8  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu+bw;
                f2[D3Q19_FieldIndex(xp,y, zp, 9)] = bomega*ft9  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu-bw;
                f2[D3Q19_FieldIndex(xp,y, zm,10)] = bomega*ft10 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu+bv;
                f2[D3Q19_FieldIndex(xm,yp,z, 11)] = bomega*ft11 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu-bv;
                f2[D3Q19_FieldIndex(xm,ym,z, 12)] = bomega*ft12 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu+bw;
                f2[D3Q19_FieldIndex(xm,y, zp,13)] = bomega*ft13 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu-bw;
                f2[D3Q19_FieldIndex(xm,y, zm,14)] = bomega*ft14 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bv+bw;
                f2[D3Q19_FieldIndex(x, yp,zp,15)] = bomega*ft15 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bv-bw;
                f2[D3Q19_FieldIndex(x, yp,zm,16)] = bomega*ft16 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv+bw;
                f2[D3Q19_FieldIndex(x, ym,zp,17)] = bomega*ft17 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv-bw;
                f2[D3Q19_FieldIndex(x, ym,zm,18)] = bomega*ft18 + trwd*(uu + cu*(1.0 + 0.5*cu));
            }
        }
    }

It would be awesome if someone could give me tips on how to find the reason for this particular behaviour or even has an idea what could cause this problem. If needed I can supply a full version of the simplified code! Thanks a lot in advance!

回答1:

Achieving scaling on shared memory systems (threaded code on a single machine) is quite tricky, and often requires large amounts of tuning. What's likely happening in your code is that part of the domain for each thread fits into cache for the "quite small" problem size, but as the problem size increases in NX and NY, the data per thread stops fitting into cache.

To avoid issues like this, it is better to decompose the domain into fixed size blocks that do not change in size with the domain, but rather in number.

const unsigned int numBlocksZ = std::ceil(static_cast<double>(NZ) / BLOCK_SIZE);
const unsigned int numBlocksY = std::ceil(static_cast<double>(NY) / BLOCK_SIZE);
const unsigned int numBlocksX = std::ceil(static_cast<double>(NX) / BLOCK_SIZE);

#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static,1)
for(unsigned int block = 0; block < numBlocks; ++block)
{
  unsigned int startZ = BLOCK_SIZE* (block / (numBlocksX*numBlocksY));
  unsigned int endZ = std::min(startZ + BLOCK_SIZE, NZ);
  for(unsigned int z = startZ; z < endZ; ++z) {
    unsigned int startY = BLOCK_SIZE*(((block % (numBlocksX*numBlocksY)) / numBlocksX);
    unsigned int endY = std::min(startY + BLOCK_SIZE, NY);
    for(unsigned int y = startY; y < endY; ++y)
    {
      unsigned int startX = BLOCK_SIZE(block % numBlocksX);
      unsigned int endX = std::min(startX + BLOCK_SIZE, NX);
      for(unsigned int x = startX; x < endX; ++x)
      {
        ...
      }
    }  
  }

An approach like the above should also increase cache locality by using 3d blocking (assuming this is a 3d stencil operation), and further improve your performance. You'll need to tune BLOCK_SIZE to find what gives you the best performance on a given system (I'd start small and increase in powers of two, e.g., 4, 8, 16...).

来源：https://stackoverflow.com/questions/50847746/scaling-issues-with-openmp

标签

c++

parallel-processing

openmp

physics

scientific-computing