May compiler optimizations be inhibited by multi-threading?

后端未结

关注

 4  1784

It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was

相关标签:

4条回答

野趣味

2020-12-09 10:08

That's a good question, even if it's rather broad, and I'm looking forward to hearing from the experts. I think @JimCownie had a good comment about this at the following discussion Reasons for omp_set_num_threads(1) slower than no openmp

Auto-vectorization and parallelization I think are often a problem. If you turn on Auto-parallelization in MSVC 2012 (auto-vectorization is on my default) they seem not to mix well together. Using OpenMP seems to disable the auto-vectorization of MSVC. The same maybe be true for GCC with OpenMP and auto-vectorization but I'm not sure.

I don't really trust auto-vectorization in the compiler anyway. One reason is that I'm not sure it does it does loop-unrolling to eliminate carried loop dependencies as well as scalar code. For this reason I try and do these things myself. I do the vectorization myself (using Agner Fog's vector class) and I unroll the loops myself. By doing this by hand I feel more confidant that I maximize all the parallelism: TLP (e.g. with OpenMP), ILP (e.g. by removing data dependencies with loop unrolling), and SIMD (with explicit SSE/AVX code).

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-12-09 10:14
There is a lot of good information in the above, but the proper answer is the some optimizations MUST be turned off when compiling OpenMP. Some compilers, such as gcc, don't do that.

The example program at the end of this answer is searching for the value 81 in four non-overlapping ranges of integers. It should always find that value. However, on all gcc versions up to at least 4.7.2, the program sometimes does not terminate with the correct answer. To see for yourself, do this:
- Copy the program into a file parsearch.c
- Compile it with gcc -fopenmp -O2 parsearch.c
- Run it with OMP_NUM_THREADS=2 ./a.out
- Run it a few more (maybe 10) times, you will see two different answers coming out
Alternatively, you can compile without -O0, and see that the outcome is always correct.

Given that the program is free of race conditions, this behaviour of the compiler under -O2 is incorrect.

The behaviour is due to the global variable globFound. Please convince yourself that under expected execution, only one of the 4 threads in the parallel for writes to that variable. The OpenMP semantics define that if the global (shared) variable is written by only one thread, the value of the global variable after the parallel-for is the value that was written by that single thread. There is no communication between the threads through the global variable, and such would not be allowed as it gives rise to race conditions.

What the compiler optimization does under -O2 is that it estimated that writing to a global variable in a loop is expensive and therefore caches it in a register. This happens in the function findit, which, after optimization, will look like:
```
int tempo = globFound ;
for ( ... ) {
    if ( ...) {
        tempo = i;
    }
globFound = tempo;
```
But with this 'optimized' code, every thread does read and write globFound, and a race condition is introduced by the compiler itself.

Compiler optimizations do need to be aware of parallel execution. Excellent material about this is published by Hans-J. Boehm, under the general topic of memory consistency.
```
#include <stdio.h>
#define BIGVAL  (100 * 1000 * 1000)

int globFound ;

void findit( int from, int to )
{
    int i ;

    for( i = from ; i < to ; i++ ) {
        if( i*i == 81L ) {
            globFound = i ;
        }
    }
}

int main( int argc, char *argv )
{
    int p ;

    globFound = -1 ;

    #pragma omp parallel for
    for( p = 0 ; p < 4 ; p++ ) {
        findit( p * BIGVAL, (p+1) * BIGVAL ) ;
    }
    if( globFound == -1 ) {
        printf( ">>>>NO 81 TODAY<<<<\n\n" ) ;
    } else {
        printf( "Found! N = %d\n\n", globFound ) ;
    }
    return 0 ;
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-09 10:23

Short from the explicit pragmas for OMP, compilers just don't know that code can be executed by multiple threads. So they can neither make that code more nor less efficient.

This has grave consequences in C++. It is particularly a problem to library authors, they cannot reasonably guess up front whether their code will be used in a program that uses threading. Very visible when you read the source of a common C-runtime and standard C++ library implementation. Such code tends to be peppered with little locks all over the place to ensure the code still operates correctly when it is used in threads. You pay for this, even if you don't actually use that code in a threaded way. A good example is std::shared_ptr<>. You pay for the atomic update of the reference count, even if the smart pointer is only ever used in one thread. And the standard doesn't provide a way to ask for non-atomic updates, a proposal to add the feature was rejected.

And it very much is detrimental the other way as well, there isn't anything the compiler can do to ensure your own code is thread-safe. It is entirely up to you to make it thread-safe. Hard to do and this goes wrong in subtle and very difficult to diagnose ways all the time.

Big problems, not simple to solve. Maybe that's a good thing, otherwise everybody could be a programmer ;)

0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2020-12-09 10:30

I think this answer describes the reason sufficiently, but I'll expand a bit here.

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ...
int *c = ...
int a = 0;

#omp parallel for reduction(+:a)
for (i = 0; i < 100; ++i)
    a += b[i] + c[i];

This code should be converted to something like this:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
for (t = 1; t < nthreads; ++t)
    /* create_thread with __omp_func1 function */
/* for master thread, don't create a thread */
struct master_data md = {
    .start = /*...*/,
    .end = /*...*/
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;
for (t = 1; t < nthreads; ++t)
{
    /* join with thread */
    /* add thread_data->a to a */
}

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
struct master_data md = {
    .start = 0,
    .end = 100
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;

So what are the differences between the no openmp version and the single threaded openmp version?

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.

To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486
 -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic
 -march=i486 -fopenmp -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.

Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.

0 讨论(0)