Fetch-and-add using OpenMP atomic operations

谁都会走 提交于 2019-12-03 13:44:25

As of openmp 3.1 there is support for capturing atomic updates, you can capture either the old value or the new value. Since we have to bring the value in from memory to increment it anyways, it only makes sense that we should be able to access it from say, a CPU register and put it into a thread-private variable.

There's a nice work-around if you're using gcc (or g++), look up atomic builtins: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

It think Intel's C/C++ compiler also has support for this but I haven't tried it.

For now (until openmp 3.1 is implemented), I've used inline wrapper functions in C++ where you can choose which version to use at compile time:

template <class T>
inline T my_fetch_add(T *ptr, T val) {
  #ifdef GCC_EXTENSION
  return __sync_fetch_and_add(ptr, val);
  #endif
  #ifdef OPENMP_3_1
  T t;
  #pragma omp atomic capture
  { t = *ptr; *ptr += val; }
  return t;
  #endif
}

Update: I just tried Intel's C++ compiler, it currently has support for openmp 3.1 (atomic capture is implemented). Intel offers free use of its compilers in linux for non-commercial purposes:

http://software.intel.com/en-us/articles/non-commercial-software-download/

GCC 4.7 will support openmp 3.1, when it eventually is released... hopefully soon :)

If you want to get old value of x and a is not changed, use (x-a) as old value:

fetch_and_add(int *x, int a) {
 #pragma omp atomic
 *x += a;

 return (*x-a);
}

UPDATE: it was not really an answer, because x can be modified after atomic by another thread. So it's seems to be impossible to make universal "Fetch-and-add" using OMP Pragmas. As universal I mean operation, which can be easily used from any place of OMP code.

You can use omp_*_lock functions to simulate an atomics:

typedef struct { omp_lock_t lock; int value;} atomic_simulated_t;

fetch_and_add(atomic_simulated_t *x, int a)
{
  int ret;
  omp_set_lock(x->lock);
  x->value +=a;
  ret = x->value;
  omp_unset_lock(x->lock);
}

This is ugly and slow (doing a 2 atomic ops instead of 1). But If you want your code to be very portable, it will be not the fastest in all cases.

You say "as the following (only non-locking)". But what is the difference between "non-locking" operations (using CPU's "LOCK" prefix, or LL/SC or etc) and locking operations (which are implemented itself with several atomic instructions, busy loop for short wait of unlock and OS sleeping for long waits)?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!