SIMD alignment issue with PPL Combinable

问题

I'm trying to sum the elements of array in parallel with SIMD. To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes because of that _mm_add_epi32 is throwing exception

concurrency::combinable<__m128i> sum_combine;

int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{

    __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
    __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));

    auto temp = _mm_add_epi32(v1, v2);

    auto &sum = sum_combine.local();   // here is the problem 


    TRACE(L"%d\n", it);
    TRACE(L"add %x\n", &sum);

    ASSERT(((unsigned long)&sum & 15) == 0);

    sum = _mm_add_epi32(temp, sum);
}
);

here is defination of combinable from ppl.h

template<typename _Ty>
class combinable
{
private:

// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
    __declspec(align(64))
    struct _Node
    {
        unsigned long _M_key;
        _Ty _M_value;                   // this might not be aligned on 16 bytes
        _Node* _M_chain;

        _Node(unsigned long _Key, _Ty _InitialValue)
            : _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
        {
        }
    };

sometimes alignment is ok and the code works fine, but most of time its not working

I have tried to used the following, but this doesn't compile

union combine 
{
        unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
        __m128i y;
};

concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;

Any suggestions for correcting the alignment issue, still using combinable.

On x64 it works fine bcause of default 16 bytes alignment. On x86 sometimes alignment problems exists.

回答1:

Just loaded sum using unaligned load

auto &sum = sum_combine.local();


#if !defined(_M_X64) 

if (((unsigned long)&sum & 15) != 0)
{
    // just for breakpoint means, sum  is unaligned.
    int a = 5;
}
auto sum_temp = _mm_loadu_si128(&sum);
sum = _mm_add_epi32(temp, sum_temp);

#else

sum = _mm_add_epi32(temp, sum);

#endif

回答2:

Since the sum variable being used with _mm_add_epi32 is not aligned you need to explicitly load/store sum using unaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128). Change:

sum = _mm_add_epi32(temp, sum);

to:

__m128i v2 = _mm_loadu_si128((__m128i *)&sum);
v2 = _mm_add_epi32(v2, temp);
_mm_storeu_si128((__m128i *)&sum, v2);

来源：https://stackoverflow.com/questions/31369019/simd-alignment-issue-with-ppl-combinable

标签

c++

simd

ppl