segmentation fault for `vmovaps'

戏子无情 提交于 2019-12-02 10:04:35
Z boson

As already explained, Knights Corner (KNC) does not have AVX512. However, it does have something similar. It turns out that the KNC vs AVX512 issue is a red herring here. The problem is in the OPs inline assembly.

Instead of using inline assembly I suggest you use intrinsics. The KNC intrinsics are described at the Intel Intrinsic Guide online.

Additionally, Przemysław Karpiński at CERN extend Agner Fog's Vector Class Library to use KNC. You can find the git repository here. If you look in the file vectorf512_mic.h you can learn a lot about the KNC intrinsics.

I converted your code to use these intrinsics (which turn out in this case to be the same as the AVX512 intrinsics):

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length /16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
        A[i] = 1;
        B[i] = 2;
    }
    for(i=0; i<AVXLength; i++ ){
        __m512 a16 = _mm512_load_ps(&A[16*i]);
        __m512 b16 = _mm512_load_ps(&B[16*i]);
        __m512 s16 = _mm512_add_ps(a16,b16);
        _mm512_store_ps(&C[16*i], s16);
    }
    return 0;
}

The KNC intrinsics are only supported by ICC. However, KNC comes with the Manycore Platform Software Stack (MCSS) which comes with a special version of gcc, k1om-mpss-linux-gcc, which can use the AVX512 like features of KNC using inline assembly.


The mnemoncis for KNC and AVX512 are the same in this case. Therefore we can use AVX512 intrinsics to discover the assembly to use

void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

and gcc -O3 -mavx512 knc.c produces

vmovaps (%rdi), %zmm0
vaddps  (%rsi), %zmm0, %zmm0
vmovaps %zmm0, (%rdx)

From this one solution using inline assembly would be

__asm__("vmovaps   (%1), %%zmm0\n"
        "vpaddps   (%2), %%zmm0, %%zmm0\n"
        "vmovaps   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        :
);

With the previous code GCC generates add instructions for each array. Here is a better solution using an index register which only generates one add.

for(i=0; i<length; i+=16){
    __asm__ __volatile__ (
            "vmovaps   (%1,%3,4), %%zmm0\n"
            "vpaddps   (%2,%3,4), %%zmm0, %%zmm0\n"
            "vmovaps   %%zmm0, (%0,%3,4)"
            :
            : "r" (C), "r" (A), "r" (B), "r" (i)
            : "memory"
     );
 }

The latest version of the MPSS (3.6) includes GCC 5.1.1 which supports AVX512 intrinsics. So I think you can use AVX512 intrinsics whenever they are the same as the KNC intrinsics and only use inline assembly when they disagree. Looking at the Intel Intrinsic guide shows that they are the same in most cases.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!