问题
I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses:
float X[SIZE];
vector float V0;
unsigned FLOAT_VEC_SIZE = sizeof(vector float);
for (int load_index =0; load_index < SIZE; load_index+=FLOAT_VEC_SIZE)
{
V0 = vec_ld(load_index, X);
/* some computation involving V0*/
}
Now if SIZE is not a multiple of FLOAT_VEC_SIZE, it is possible that V0 contains some invalid memory elements in the last loop iteration. One way to avoid that is to reduce the loop by one iteration, another one is to mask off the potential invalid elements, is there any other useful trick here? Considering the above is inner most in a set of nested loops. So any additional non-SIMD instruction will come with a performance penalty!
回答1:
Ideally you should pad your array to a multiple of vec_step(vector float)
(i.e. multiple of 4 elements) and then mask out any additional unwanted values from SIMD processing or use scalar code to deal with the last few elements, e.g.
const INT VF_ELEMS = vec_step(vector float);
const int VEC_SIZE = (SIZE + VF_ELEMS - 1) / VF_ELEMS; // number of vectors in X, rounded up
vector float VX[VEC_SIZE]; // padded array with 16 byte alignment
float *X = = (float *)VX; // float * pointer to base of array
for (int i = 0; i <= SIZE - VF_ELEMS; i += VF_ELEMS)
{ // for each full SIMD vector
V0 = vec_ld(0, &X[i]);
/* some computation involving V0 */
}
if (i < SIZE) // if we have a partial vector at the end
{
#if 1 // either use SIMD and mask out the unwanted values
V0 = vec_ld(0, &X[i]);
/* some SIMD computation involving partial V0 */
#else // or use a scalar loop for the remaining 1..3 elements
/* small scalar loop to handle remaining points */
#endif
}
回答2:
Sometimes zero-padding is not an option as in the case of const array. On the other hand, adding scalar code can result in inter-mixing of vector and scalar results, for example when writing back computation results; masking off the unwanted values looks like a better solution. Note that this assumes addresses with 16 byte alignment. Toy example, clearing last three elements of SIMD vector
vector bool int V_MASK = (vector bool int) {0,0,0,0};
unsigned int all_ones = 0xFFFFFFFFFFFFFFFF;
unsigned int * ptr_mask = (unsigned int *) &V_MASK;
ptr_mask[0]= all_ones;
vector float XV = vec_ld(0,some_float_ptr);
XV = vec_and(XV,V_MASK);
来源:https://stackoverflow.com/questions/13029427/avoiding-invalid-memory-load-with-simd-instructions