Porting MMX/SSE instructions to AltiVec

不问归期 提交于 2019-12-05 18:36:08

You're not far off - I fixed a few minor problems, cleaned up the code a little, added a test harness, and it seems to work OK now:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <altivec.h>

static int convolve_ref(const short *a, const short *b, int n)
{
    int out = 0;
    int i;

    for (i = 0; i < n; ++i)
    {
        out += a[i] * b[i];
    }

    return out;
}

static inline int convolve_altivec(const short *a, const short *b, int n)
{
    int out = 0;
    union {
        vector signed int m128;
        int i32[4];
    } tmp;

    const vector signed int zero = {0, 0, 0, 0};

    assert(((unsigned long)a & 15) == 0);
    assert(((unsigned long)b & 15) == 0);

    tmp.m128 = zero;

    while (n >= 8)
    {
        tmp.m128 = vec_msum(*((vector signed short *)a),
                            *((vector signed short *)b), tmp.m128);

        a += 8;
        b += 8;
        n -= 8;
    }

    out = tmp.i32[0] + tmp.i32[1] + tmp.i32[2] + tmp.i32[3];

    while (n --)
        out += (*(a++)) * (*(b++));

    return out;
}

int main(void)
{
    const int n = 100;

    vector signed short _a[n / 8 + 1];
    vector signed short _b[n / 8 + 1];

    short *a = (short *)_a;
    short *b = (short *)_b;

    int sum_ref, sum_test;

    int i;

    for (i = 0; i < n; ++i)
    {
        a[i] = rand();
        b[i] = rand();
    }

    sum_ref = convolve_ref(a, b, n);
    sum_test = convolve_altivec(a, b, n);

    printf("sum_ref = %d\n", sum_ref);
    printf("sum_test = %d\n", sum_test);

    printf("%s\n", sum_ref == sum_test ? "PASS" : "FAIL");

    return 0;
}

(Warning: all of my Altivec experience comes from working on Xbox360/PS3 - I'm not sure how different they are from other Altivec platforms).

First off, you should check your pointer alignment. Most vector loads (and stores) operations are expected to be from 16-byte aligned addresses. If they aren't, things will usually carry on without warning, but you won't get the data you were expecting.

It's possible (but slower) to do unaligned loads, but you basically have to read a bit before and after your data and combine them. See Apple's Altivec page. I've also done it before using an lvlx and lvrx load instructions, and then ORing them together.


Next up, I'm not sure your multiplies and adds are the same. I've never used either _mm_madd_pi16 or vec_msum, so I'm not positive they're equivalent. You should step through in a debugger and make sure they give you the same output for the same input data. Another possible difference is that they may treat overflow differently (e.g. modular vs. saturate).


Last but not least, you're computing 4 ints at a time instead of 2. So your union should hold 4 ints, and you should sum all 4 of them at the end.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!