segmentation fault for `vmovaps'

I wrote a code to add two arrays using KNC instructions with (512bit long vectors) on Xeon Phi intel coprocessor. However I've got segmentation part in the inline assembly part.

Here it is my code:

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length / 16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
            A[i] = 1;
            B[i] = 2;
    }

    float * pA = A;
    float * pB = B;
    float * pC = C;
    for(i=0; i<AVXLength; i++ ){
         __asm__("vmovaps %1,%%zmm0\n"
                    "vmovaps %2,%%zmm1\n"
                    "vaddps %%zmm0,%%zmm0,%%zmm1\n"
                    "vmovaps %%zmm0,%0;"
            : "=m" (pC) : "m" (pA), "m" (pB));

            pA += 512;
            pB += 512;
            pC += 512;
    }
    return 0;
}

I am using gcc as a compiler (because I don't have money to buy intel compiler). And this is my command line to compile this code:

k1om-mpss-linux-gcc add.c -o add.out

The problem was in the inline assembly. The following inline assembly fixed it.

__asm__("vmovaps %1,%%zmm1\n"
        "vmovaps %2,%%zmm2\n"
        "vaddps %%zmm1,%%zmm2,%%zmm3\n"
        "vmovaps %%zmm3,%0;"
        : "=m" (*pC) : "m" (*pA), "m" (*pB));

Z boson

As already explained, Knights Corner (KNC) does not have AVX512. However, it does have something similar. It turns out that the KNC vs AVX512 issue is a red herring here. The problem is in the OPs inline assembly.

Instead of using inline assembly I suggest you use intrinsics. The KNC intrinsics are described at the Intel Intrinsic Guide online.

Additionally, Przemysław Karpiński at CERN extend Agner Fog's Vector Class Library to use KNC. You can find the git repository here. If you look in the file vectorf512_mic.h you can learn a lot about the KNC intrinsics.

I converted your code to use these intrinsics (which turn out in this case to be the same as the AVX512 intrinsics):

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length /16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
        A[i] = 1;
        B[i] = 2;
    }
    for(i=0; i<AVXLength; i++ ){
        __m512 a16 = _mm512_load_ps(&A[16*i]);
        __m512 b16 = _mm512_load_ps(&B[16*i]);
        __m512 s16 = _mm512_add_ps(a16,b16);
        _mm512_store_ps(&C[16*i], s16);
    }
    return 0;
}

The KNC intrinsics are only supported by ICC. However, KNC comes with the Manycore Platform Software Stack (MCSS) which comes with a special version of gcc, k1om-mpss-linux-gcc, which can use the AVX512 like features of KNC using inline assembly.

The mnemoncis for KNC and AVX512 are the same in this case. Therefore we can use AVX512 intrinsics to discover the assembly to use

void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

and gcc -O3 -mavx512 knc.c produces

vmovaps (%rdi), %zmm0
vaddps  (%rsi), %zmm0, %zmm0
vmovaps %zmm0, (%rdx)

From this one solution using inline assembly would be

__asm__("vmovaps   (%1), %%zmm0\n"
        "vpaddps   (%2), %%zmm0, %%zmm0\n"
        "vmovaps   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        :
);

With the previous code GCC generates add instructions for each array. Here is a better solution using an index register which only generates one add.

for(i=0; i<length; i+=16){
    __asm__ __volatile__ (
            "vmovaps   (%1,%3,4), %%zmm0\n"
            "vpaddps   (%2,%3,4), %%zmm0, %%zmm0\n"
            "vmovaps   %%zmm0, (%0,%3,4)"
            :
            : "r" (C), "r" (A), "r" (B), "r" (i)
            : "memory"
     );
 }

The latest version of the MPSS (3.6) includes GCC 5.1.1 which supports AVX512 intrinsics. So I think you can use AVX512 intrinsics whenever they are the same as the KNC intrinsics and only use inline assembly when they disagree. Looking at the Intel Intrinsic guide shows that they are the same in most cases.

来源：https://stackoverflow.com/questions/34148636/segmentation-fault-for-vmovaps

标签

Linux

gcc

inline-assembly

xeon-phi