I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is
convolute_1D
transpose
convolute_1D
transpose
FWIW, on a 3 years old Core i7 M laptop CPU, this naive 4x4 transpose was barely slower than your SSE version, while almost 40% faster on a newer Intel Xeon E5-2630 v2 @ 2.60GHz desktop CPU.
inline void transpose4x4_naive(float *A, float *B, const int lda, const int ldb) {
const float r0[] = { A[0], A[1], A[2], A[3] }; // memcpy instead?
A += lda;
const float r1[] = { A[0], A[1], A[2], A[3] };
A += lda;
const float r2[] = { A[0], A[1], A[2], A[3] };
A += lda;
const float r3[] = { A[0], A[1], A[2], A[3] };
B[0] = r0[0];
B[1] = r1[0];
B[2] = r2[0];
B[3] = r3[0];
B += ldb;
B[0] = r0[1];
B[1] = r1[1];
B[2] = r2[1];
B[3] = r3[1];
B += ldb;
B[0] = r0[2];
B[1] = r1[2];
B[2] = r2[2];
B[3] = r3[2];
B += ldb;
B[0] = r0[3];
B[1] = r1[3];
B[2] = r2[3];
B[3] = r3[3];
}
Strangely enough, the older laptop CPU is faster than the dual E5-2630 v2 desktop with twice the core, but that's a different story :)
Otherwise, you might also be interested in http://research.colfaxinternational.com/file.axd?file=2013%2F8%2FColfax_Transposition-7110P.pdf http://colfaxresearch.com/multithreaded-transposition-of-square-matrices-with-common-code-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors/ (requires login now...)
I'd guess that your best bet would be to try and combine the convolution and the transpose - i.e. write out the results of the convolve transposed as you go. You're almost certainly memory bandwidth limited on the transpose so reducing the number of instructions used for the transpose isn't really going to help (hence the lack of improvement from using AVX). Reducing the number of passes over your data is going to give you the best performance improvements.
Consider this 4x4 transpose.
struct MATRIX {
union {
float f[4][4];
__m128 m[4];
__m256 n[2];
};
};
MATRIX myTranspose(MATRIX in) {
// This takes 15 assembler instructions (compile not inline),
// and is faster than XMTranspose
// Comes in like this 1 2 3 4 5 6 7 8
// 9 10 11 12 13 14 15 16
//
// Want the result 1 5 9 13 2 6 10 14
// 3 7 11 15 4 8 12 16
__m256 t0, t1, t2, t3, t4, t5, n0, n1;
MATRIX result;
n0 = in.n[0]; // n0 = 1, 2, 3, 4, 5, 6, 7, 8
n1 = in.n[1]; // n1 = 9, 10, 11, 12, 13, 14, 15, 16
t0 = _mm256_unpacklo_ps(n0, n1); // t0 = 1, 9, 2, 10, 5, 13, 6, 14
t1 = _mm256_unpackhi_ps(n0, n1); // t1 = 3, 11, 4, 12, 7, 15, 8, 16
t2 = _mm256_permute2f128_ps(t0, t1, 0x20); // t2 = 1, 9, 2, 10, 3, 11, 4, 12
t3 = _mm256_permute2f128_ps(t0, t1, 0x31); // t3 = 5, 13, 6, 14, 7, 15, 8, 16
t4 = _mm256_unpacklo_ps(t2, t3); // t2 = 1, 5, 9, 13, 3, 7, 11, 15
t5 = _mm256_unpackhi_ps(t2, t3); // t3 = 2, 6, 10, 14, 4, 8, 12, 16
result.n[0] = _mm256_permute2f128_ps(t4, t5, 0x20); // t6 = 1, 5, 9, 13, 2, 6, 10, 14
result.n[1] = _mm256_permute2f128_ps(t4, t5, 0x31); // t7 = 3, 7, 11, 15, 4, 8, 12, 16
return result;
}