Monte Carlo simulation runs significantly slower than sequential

问题

I'm new to the concept of concurrent and parallel programing in general. I'm trying to calculate Pi using Monte Carlo method in C. Here is my source code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

int main(void)
{
    long points;
    long m = 0;
    double coordinates[2];
    double distance;
    printf("Enter the number of points: ");
    scanf("%ld", &points);

    srand((unsigned long) time(NULL));
    for(long i = 0; i < points; i++)
    {
        coordinates[0] = ((double) rand() / (RAND_MAX));
        coordinates[1] = ((double) rand() / (RAND_MAX));
        distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));
        if(distance <= 1)
            m++;
    }

    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

When I try to make this program parallel using openmp api it runs almost 4 times slower.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{

    long total_points;              // Total number of random points which is given by the user
    volatile long total_m = 0;      // Total number of random points which are inside of the circle
    int threads = get_nprocs();     // This is needed so each thred knows how amny random point it should generate
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel
    {
       double coordinates[2];          // Contains the x and y of each random point
       long m = 0;                     // Number of points that are in the circle for any particular thread
       long points = total_points / threads;   // Number of random points that each thread should generate
       double distance;                // Distance of the random point from the center of the circle, if greater than 1 then the point is outside of the circle
       srand((unsigned long) time(NULL));

        for(long i = 0; i < points; i++)
        {
           coordinates[0] = ((double) rand() / (RAND_MAX));    // Random x
           coordinates[1] = ((double) rand() / (RAND_MAX));    // Random y
           distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));   // Calculate the distance
          if(distance <= 1)
              m++;
       }

       #pragma omp critical
       {
           total_m += m;
       }
    }

    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);
}

I tried looking up the reason but there was different answers to different algorithms.

回答1:

There are two sources of overhead in your code namely, the critical region, and the call to the rand(). Instead of rand() use rand_r:

I think you're looking for rand_r(), which explicitly takes the current RNG state as a parameter. Then each thread should have it's own copy of seed data (whether you want each thread to start off with the same seed or different ones depends on what you're doing, here you want them to be different or you'd get the same row again and again).

The critical region can be removed by using OpenMP clause reduction. Moreover, you neither need to call the sqrt nor to divide manually the points by the threads (i.e., long points = total_points / threads;), you can use #pragma omp for for that. So your code would look like the following:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long total_points; 
    long total_m = 0;
    int threads = get_nprocs();   
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel 
    {                  
        unsigned int myseed = omp_get_thread_num();
        #pragma omp for reduction (+: total_m)
        for(long i = 0; i < total_points; i++){
            if(pow((double) rand_r(&myseed) / (RAND_MAX), 2) + pow((double) rand_r(&myseed) / (RAND_MAX), 2) <= 1)
               total_m++;
         }
     }
    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);

}

A quick test on my machine for an input of 1000000000:

sequential : 16.282835 seconds 
2 threads  :  8.206498 seconds  (1.98x faster)
4 threads  :  4.107366 seconds  (3.96x faster)
8 threads  :  2.728513 seconds  (5.96x faster)

Bear in mind that my machine has only 4 cores. Notwithstanding, for a more meaningful comparison, one should try to optimized the sequential code as much as possible, and then compared it with the parallel versions. Naturally, if the sequential version is as optimized as possible, the speedups of the parallel version might drop. For instance, testing the current parallel version without modifications against the sequential version of code provided by @user3666197, yield the following results:

sequential :  9.343118 seconds 
2 threads  :  8.206498 seconds  (1.13x faster)
4 threads  :  4.107366 seconds  (2.27x faster)
8 threads  :  2.728513 seconds  (3.42x faster)

However, one could also improve the parallel version as well as, and so on and so fourth. For instance, if one takes @user3666197 version, fix the race condition of the update of the coordinates (which is shared among threads), and adds the OpenMP #pragma omp for, we have the following code:

int main(void)
{
    double start = omp_get_wtime();
    long points = 1000000000; //....................................... INPUT AVOIDED
    long m = 0;
    unsigned long HAUSNUMERO = 1;
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX;

    int threads = get_nprocs();
    omp_set_num_threads(threads);
    #pragma omp parallel reduction (+: m )
    {
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
        #pragma omp for nowait
        for(long i = 0; i < points; i++)
        {
            double x = rand_r( &aThreadSpecificSEED_x );
            double y = rand_r( &aThreadSpecificSEED_y );
            m += (1  >= ( x * x + y * y ) * DIV1byMAXbyMAX);
        }
    }
    double end = omp_get_wtime();
    printf("%f\n",end-start);
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

which yields the following results:

sequential :  9.160571 seconds 
2 threads  :  4.769141 seconds  (1.92 x faster)
4 threads  :  2.456783 seconds  (3.72 x faster)
8 threads  :  2.203758 seconds  (4.15 x faster)

I am compiling with the flags -O3 -std=c99 -fopenmp, and using the gcc version 4.9.3 (MacPorts gcc49 4.9.3_0).

回答2:

The problem that you have is inherent to the use of function rand() that is not necessary reentrant. So when more than one thread enter this function there is a competition between threads to read and write data in a non thread-safe manner. This competition leads to an extremly slow behaviour. Instead of the function rand(), you can look for a similar function that is re-entrant to get rid of this problem.

回答3:

You need to replace rand() by a thread specific random number generator accessing a local variable only. Otherwise the threads compete on synchronising the same cache line.

回答4:

Adding a few cents beyond the Amdahl's Law argument

Having an utmost trivial "useful"-work in the loop, the AVX-512 register-parallel and SIMD-aligned tricks will most probably outperform any OpenMP heavyweight processing preparations for points << 1E15+.

This answer was provided to inspire where else the code can get large savings, due to analytically equivalent problem formulation ( avoiding expensive SQRT-s and DIV-s, where no added value gets received )

The code is available for any further online experimentation & profiling at Godbolt.org IDE.

Revised reduction code at Godbolt.org IDE for any further re-testing.

Proposing timed-sections is left to @dreamcrash for having a level plainfield for re-testing with meaningful comparisons:

#include <stdio.h> //............................. -O3 -fopenmp
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long points = 1000; //....................................... INPUT AVOIDED
    long m = 0;
//  double coordinates[2]; //.................................... OBVIOUS TO BE PUT IN PRIVATE PART
    unsigned long HAUSNUMERO = 1; //............................. AVOID SIN OF IREPRODUCIBILITY
//  printf( "RAND_MAX is %ld on this platform\n", RAND_MAX );//.. 2147483647 PLATFORM SPECIFIC
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX; //......... PRECOMPUTE A STATIC VALUE

    int threads = get_nprocs();
    omp_set_num_threads(threads);

    #pragma omp parallel reduction (+: m )
    {
    //..............................SEED.x PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
    //..............................SEED.y PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        double x, y;

        for(long i = 0; i < points / threads; i++)
        {   
            x = rand_r( &aThreadSpecificSEED_x );
            y = rand_r( &aThreadSpecificSEED_y );

            if( 1  >= ( x * x //................. NO INTERIM STORAGE NEEDED
                      + y * y //................. NO SQRT EVER NEEDED
                        ) * DIV1byMAXbyMAX //.... MUL is WAY FASTER THAN DIV
                   )
            m++;
        }
    }
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

来源：https://stackoverflow.com/questions/65560634/monte-carlo-simulation-runs-significantly-slower-than-sequential

标签

c++

multithreading

parallel-processing

openmp