GCC Hinting at Vectorization

问题

I would like GCC to vectorize the below code.-fopt-info tells me that GCC is not currently. I believe the problem is the strided access of W or possible the backward incrementing of k. Note that height and width are constants and index_type is set to unsigned long currently.

I removed some comments

114  for (index_type k=height-1;k+1>0;k--) {
116    for (index_type i=0;i<width;i++) {
117      Yp[k*width + i] = 0.0;                                                            
119      for (index_type j=0;j<width;j++) {                                                                            
121        Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122      }
123      Yp[k*width + i] *= DF(Ap[k*width + i]);
124    }
125  }

I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

Is there a good way to make this vectorize? Can you refer me to further information?

Is my method for indexing a bad idea (ie the k*width*width + ...)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.

EDIT: This might be useful

The -fopt-info-missed output for these lines

./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.

EDIT:

Minimal example is HERE

I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why

EDIT:

Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.

EDIT:

Same code as in the link from the previous edit

#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>

typedef float value_type;
typedef unsigned long index_type;

static value_type F(value_type v) {
  return 1.0/(1.0 + exp(-v));
}

static value_type DF(value_type v) {
  const value_type Ev = exp(-v);
  return Ev/((1.0 + Ev)*(1.0 + Ev));
}

#ifndef WITH_BLAS

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    for (index_type i=0;i<width;i++) {
      Yp[k*width + i] = 0.0;
      for (index_type j=0;j<width;j++) {
    Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
      Yp[k*width + i] *= DF(Ap[k*width + i]);
    }
  }
}

#else

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
        W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
    for (index_type i=0;i<width;i++)
      Yp[k*width + i] *= DF(Ap[k*width + i]);
  }
}

#endif

int main() {
  const index_type height=10, width=10000;

  value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
    *W=malloc(height*width*width*sizeof(value_type)),
    *Yp=malloc((height+1)*width*sizeof(value_type)),
    *Dp=malloc(width*sizeof(value_type));

  get_Yp(Ap, W, Yp, Dp, height, width);
  printf("Done %f\n", Yp[3]);

  return 0;
}

回答1:

j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.
First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.

To "communicate" to compiler that there is a reduction - you need to use #pragma omp simd reduction (+ : variable_name) before j-loop.

This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"

I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):

 tmpSUM = 0; 
 #pragma omp simd reduction (+: tmpSUM) 
 for (index_type j=0;j<width;j++) {                                                                            
        tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
 Yp[k*width + i] = tmpSUM

I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.
Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.

So, if you are still limited to GCC, I would suggest:

Make sure it's fresh enough GCC (4.9 or higer)
Use signed induction variable and still try omp simd reduction on temporary local tmp SUM(because it should enable more advanced vectorization techniques anyway)
If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.
1. Last comment: is access pattern in your code and more generally const-stride so bad? Answer: for some platforms/compilers const-stride will not kill your performance. But still, ideally you need more cache-friendly algorithm. Check Not sure how to explain some of the performance results of my parallelized matrix multiplication code . One more option is considering MKL or BLAS (like suggested by other reply) if your code is really memory-bound and if you don't have time to deal with memory optimizations yourself.

回答2:

I tested the following code in GCC 5.3.1, Clang 3.7.1 , ICC 13.0.1, and MSVC 2015

void foo(float *x) 
    unsigned i;
    for (i=0; i<1024; i++) x[0] += x[1024 + i];
}

I used -Ofast for GCC, Clang, and ICC and /O2 /fp:fast with MSVC. Looking at the assembly shows that only ICC managed to vectorize the loop.

However, with the same compile options all the compilers vectorized the following code

void foo2(float *x) {
    float sum = 0.0f;
    unsigned i;
    for (i=0; i<1024; i++) sum += x[1024 + i];
    x[0] = sum;
}

I'm not sure why only ICC vectorized foo. It seems clear to me that there is no dependency.

GCC does not unroll the loop (and -funroll-loops does not help). MSVC unroll two times, Clang unrolls four times, and ICC eight times. Intel processors since at et least Core2 have a latency of at least 3 cycles for addition so unrolling four or more times is going to work better than twice or none at all.

In any case using

value_type sum = 0.0;
for (index_type j=0;j<width;j++) {
    sum += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] = sum*DF(Ap[k*width + i]);

vectorizes your loop.

回答3:

In my experience, demanding GCC to vectorize properly is toublesome. Especially if you hope to fully utilize modern hardware (e.g. AVX2). I deal a lot with vectorization in my code. My solution: Do not attempt to use GCC for vectorization at all. Rather formulate all operations you'd like to be vectorized in terms of BLAS (basic linear algebra subprograms) calls, and link with the OpenBLAS library which implements BLAS for modern hardware. OpenBLAS also offers parallelization on shared memory nodes (either with pthreads or OpenMP), which - depending on your application - can be very important to further increase execution speed: in this way you can combine vectorization with (shared memory) parallelization.

Furthermore, I suggest that you align your memory to fully utilize AVX and/or AVX2 etc. . I.e. do not allocate memory with malloc or new, use memalign or aligned_alloc (depending on what your system supports). For example, in case you intend to use AVX2 you should align your allocation so that the address is a multiple of 64 bytes (8 * 8 doubles).

来源：https://stackoverflow.com/questions/34860918/gcc-hinting-at-vectorization

标签

gcc

vectorization

auto-vectorization