OpenMP with Game of Life visualization using SFML

问题

Hello I'm trying to compare speeds between serial and parallel version of 'Game of Life'. I used SFML library to visualize game of life like this. SFML window Serial logic is simple like below.

for (int i = 0; i < height; i++) {
            for (int j = 0; j < width; j++) {
                int neighbor = 0;

                // check 8 cells around.
                // 1 2 3  -1
                // 4   5  0
                // 6 7 8  +1

                // (1)
                if (gamefieldSerial.isAvailableCell(UP(i), LEFT(j))) {
                    if(gamefieldSerial[UP(i)][LEFT(j)] == LIVE) neighbor++;
                }
                // (2)
                if (gamefieldSerial.isAvailableCell(UP(i), j)) {
                    if (gamefieldSerial[UP(i)][j] == LIVE)      neighbor++;
                }
                // (3)
                if (gamefieldSerial.isAvailableCell(UP(i), RIGHT(j))) {
                    if (gamefieldSerial[UP(i)][RIGHT(j)] == LIVE)   neighbor++;
                }
                // (4)
                if (gamefieldSerial.isAvailableCell(i, LEFT(j))) {
                    if (gamefieldSerial[i][LEFT(j)] == LIVE)        neighbor++;
                }
                // (5)
                if (gamefieldSerial.isAvailableCell(i, RIGHT(j))) {
                    if (gamefieldSerial[i][RIGHT(j)] == LIVE)       neighbor++;
                }
                // (6)
                if (gamefieldSerial.isAvailableCell(DOWN(i), LEFT(j))) {
                    if (gamefieldSerial[DOWN(i)][LEFT(j)] == LIVE)  neighbor++;
                }
                // (7)
                if (gamefieldSerial.isAvailableCell(DOWN(i), j)) {
                    if (gamefieldSerial[DOWN(i)][j] == LIVE)        neighbor++;
                }
                // (8)
                if (gamefieldSerial.isAvailableCell(DOWN(i), RIGHT(j))) {
                    if (gamefieldSerial[DOWN(i)][RIGHT(j)] == LIVE) neighbor++;
                }

                // -- Rule of Game of Life
                // Cell borns when exactly 3 neighbor is LIVE
                // Cell remains alive when 2 or 3 neighbor is LIVE
                // Cell with more than 3 neighbor dies with overpopulation
                // Cell with less than 2 neighbor dies with underpopulation
                if (gamefieldSerial[i][j] == DEAD) {
                    if (neighbor == 3) {
                        gamefieldSerial[i][j] = LIVE;
                    }
                }
                else if (gamefieldSerial[i][j] == LIVE) {
                    if (neighbor < 2 || neighbor > 3) {
                        gamefieldSerial[i][j] = DEAD;
                    }
                }
            }

It took 3940ms on 768*256 cells with 100 generations. But in parallel version I implemented like below,

#pragma omp parallel for num_threads(4)
        for (int t = 0; t < width * height; t++) {
            int i = t / width;
            int j = t % width;
            int neighbor = 0;

            // check 8 cells around.
            // 1 2 3  -1
            // 4   5  0
            // 6 7 8  +1

            // (1)
            if (gamefieldParallel.isAvailableCell(UP(i), LEFT(j))) {
                if (gamefieldParallel[UP(i)][LEFT(j)] == LIVE) neighbor++;
            }
            // (2)
            if (gamefieldParallel.isAvailableCell(UP(i), j)) {
                if (gamefieldParallel[UP(i)][j] == LIVE)      neighbor++;
            }
            // (3)
            if (gamefieldParallel.isAvailableCell(UP(i), RIGHT(j))) {
                if (gamefieldParallel[UP(i)][RIGHT(j)] == LIVE)   neighbor++;
            }
            // (4)
            if (gamefieldParallel.isAvailableCell(i, LEFT(j))) {
                if (gamefieldParallel[i][LEFT(j)] == LIVE)        neighbor++;
            }
            // (5)
            if (gamefieldParallel.isAvailableCell(i, RIGHT(j))) {
                if (gamefieldParallel[i][RIGHT(j)] == LIVE)       neighbor++;
            }
            // (6)
            if (gamefieldParallel.isAvailableCell(DOWN(i), LEFT(j))) {
                if (gamefieldParallel[DOWN(i)][LEFT(j)] == LIVE)  neighbor++;
            }
            // (7)
            if (gamefieldParallel.isAvailableCell(DOWN(i), j)) {
                if (gamefieldParallel[DOWN(i)][j] == LIVE)        neighbor++;
            }
            // (8)
            if (gamefieldParallel.isAvailableCell(DOWN(i), RIGHT(j))) {
                if (gamefieldParallel[DOWN(i)][RIGHT(j)] == LIVE) neighbor++;
            }

            // -- Rule of Game of Life
            // Cell borns when exactly 3 neighbor is LIVE
            // Cell remains alive when 2 or 3 neighbor is LIVE
            // Cell with more than 3 neighbor dies with overpopulation
            // Cell with less than 2 neighbor dies with underpopulation
            if (gamefieldParallel[i][j] == DEAD) {
                if (neighbor == 3) {
                    gamefieldParallel[i][j] = LIVE;
                }
            }
            else if (gamefieldParallel[i][j] == LIVE) {
                if (neighbor < 2 || neighbor > 3) {
                    gamefieldParallel[i][j] = DEAD;
                }
            }
        }

It took 5746ms on same environment. I thought applying openMP's 'for' directive in for-loop enhances the performance, but it doesn't. Should I have to approach in another way?

============= Both gamefieldParallel and gamefieldSerial is instance of GameField class which has dynamically allocated int** field variable for cells. I'm using operator overloading to access it like two dimensional array.(Sorry for bad english!)

回答1:

In both your serial version and your parallel version you are updating the game field as you iterate through it. This is a bug with both versions. Say you calculate the new state for gameField[0][0] and then update it. Now you move onto gameField[0][1] --- as part of calculating its new state, you look left to gameField[0][0] which already contains that cell's newly-updated state, but the Game of Life's rules have to be applied to the first cell's previous state.

In other words, you should have a read-only (const) oldGameField and then fill new states into newGameField. Once you have calculated all cells, you can use the new field as the old field for the next game iteration.

Fixing that bug is actually an important part of sorting out your performance problem.

Multi-threading

Instead of thinking of 4 processors doing these updates, imagine 4 people doing it with pencil and paper. Because you will now treat oldGameField as read-only, it is safe to photocopy the old page and give a copy to each person. Each person knows that no-one else is going to change the old page, so they don't have to worry about their copy ever becoming out of date.

But you sill only have one page for the newGameField. In your serial approach, there is only one person, who has the page and the pencil exclusively to themselves. Now you have four people trying to draw on the same page at the same time. They waste more time passing the pencil and page amongst themselves than the time they spend on doing the calculations! It literally takes four people longer to do the job than one person could have done it alone.

This is not meant to be an exact representation of what's going on inside hardware, but when we consider any locking and/or atomics OpenMP might be using, and things like the memory caches in your processor's cores --- it's pretty close.

So how do we fix it?

You and your three friends could decide that each of you would update one quarter of the field. Maybe you take the whole top quarter of the board. The next person takes the second quarter, and so on. And rather than fighting over one sheet of paper to draw the new game field, you each have your own piece of paper to draw just your quarter of the new states.

Once you are all done, you quickly paste your four pieces of paper together to make the new game field page.

The point here is to make sure that each thread is reading from memory that no-one is changing, and that each thread writes only to memory that no other thread is accessing. This means that cores do not block each other with memory writes, and do not have to flush their caches when they see other cores writing to shared memory.

Making sure that each thread's new game field memory isn't close enough to cause interference to the memory used by some other thread is tricky. You generally need to know some information on the size of the cache lines in your cores, whether or not your platform use something called "NUMA", etc etc.

I don't know OpenMP - but perhaps it has some built-in support to help with this.

In summary

Separate the old state from the new state
Your threads can all happily share the old state (because no-one is changing it during an iteration)
Break the work into chunks - one for each thread
Give each thread its own piece of memory to store its results.
Once all threads are finished, have you main thread bring all their results together into one combined new game field.

回答2:

GoL is the perfect specimen for OpenMP parallelisation since it is embarrassingly parallel - computing the value of a cell in the current generation does not depend on the computations of the neighbouring cells. The problem here is that you are reading and writing to the same array, which is wrong from an implementation point of view. In you sequential code, this simply results in wrongly computed cell states, but in parallel you run into problems such as false sharing that slow down the program significantly. Also, you have replaced two nested loops with a single one and you compute the row and column indexes using modulo arithmetic, which is probably the biggest source of slowdown. Another reason for slowdown is having the parallel region in the inner loop - you are paying the price for activating the threads in the region for each generation.

What you need to do is use two arrays - one for the previous generation and one for the current one. Read from the former array and write into the latter. Once finished, swap the arrays and repeat. In pseudo-C++ with OpenMP, the solution looks like this:

#pragma omp parallel
{
  // Generations loop (1)
  for (int gen = 0; gen < NUM_GENERATIONS; gen++) {

    // Compute the new current generation (2)
    #pragma omp for
    for (int i = 0; i < height; i++) {
      for (int j = 0; j < width; j++) {
        // Count the number of live neighbours of current[i][j] (3)
        int neighbours = count_neighbours(current, i, j);

        // Update the state of the current cell (4)
        if (current[i][j] == DEAD && neighbours == 3)
          next[i][j] = LIVE;
        else if (current[i][j] == LIVE)
          next[i][j] = (neighbours < 2 || neighbours > 3) ? DEAD : LIVE;
      }
    }

    // The following block runs in the master thread (5)
    #pragma omp master
    {
      // Swap the current and next arrays
      std::swap(current, next);

      // Display the board state (if necessary)
      display(current);
    }

    // Synchronise the threads before the next iteration (6)
    #pragma omp barrier
  }
}

Things to notice (numbers correspond to those in the code comments):

The outer (generations) loop is inside the parallel region. This removes the overhead from activating and deactivating the region on each iteration.
The worksharing construct for is applied to the loop that runs over the rows of the board. That is enough to optimally parallelise the problem. If you can ensure that width times sizeof the type of elements in next is a multiple of 64 bytes (the cache line size on most CPUs), the possibility for false sharing will be elliminated.
Counting the number of neighbours involves the values in the current array.
The new values go into the next array.
Once the next generation is fully computed, we need to swap the arrays and make next to become current for the next iteration of the generations loop. This should be done by a single thread and in this case this burden falls on the master thread. Notice that this swap is most efficient if both current and next are pointers to the actual arrays. Swapping array values element for element is slow. Swapping two pointers to those arrays is crazy fast. Using the master thread gives you the opportunity to also make GUI calls, e.g., display() (assuming that's the function that draws the board to the screen).
The master construct doesn't have an implicit barrier on exit, so we need to explicitly synchronise the threads, otherwise some thread(s) might start doing the next iteration before we've swapped the arrays.

If you are not going to display the intermediate board states, the following code:

    // The following block runs in the master thread (5)
    #pragma omp master
    {
      // Swap the current and next arrays
      std::swap(current, next);

      // Display the board state (if necessary)
      display(current);
    }

    // Synchronise the threads before the next iteration (6)
    #pragma omp barrier

can be replaced with:

    #pragma omp single
    std::swap(current, next);

The single construct has an implicit barrier on exit, so no need to add an explicit one.

I'm going to give you yet another unsolicited advice on speeding up the computation. Having all those conditionals

if (gamefield.isAvailableCell(UP(i), LEFT(j))) {
   ...
}

slows your code since modern CPUs do better if there are no conditionals. This code only serves to catch cells at the borders of the simulation array. So, instead of checking for each cell if it has a neighbour in a given direction, simply make the board two cells wider (one at the beginning and one at the end of the row) and two cells taller (one row at the top and one at the bottom) and keep the extra cells empty (DEAD). Then isAvailableCell() will always be true and you can get rid of the conditionals. Just remember to run the loops from 1 to width/height inclusively.

来源：https://stackoverflow.com/questions/61393606/openmp-with-game-of-life-visualization-using-sfml

标签

c++

parallel-processing

openmp

conways-game-of-life