Like I said in comments, your biggest error is benchmarking with -O0
. I discussed exactly why testing with -O0
is a terrible idea in the first part of another post.
Benchmarks should be done with at least -O2, preferably with the same optimizations as your full project will build with, if you're trying to test test what source makes the fastest asm.
-O0
explains inline asm being way faster than C with intrinsics (or regular compiled C, for C strlen implementation borrowed from glibc).
IDK -O0
would still optimize away loop that discards the result of library strlen repeatedly, or if it somehow just avoided some other huge performance pitfall. It's not interesting to guess about exactly what happened in such a flawed test.
I tightened up your SSE2 inline-asm version. Mostly just because I've been playing with gcc inline asm input/output constraints recently, and wanted to see what it would look like if I wrote it to let the compiler choose which registers to use for temporaries, and avoided unneeded instructions.
The same inline asm works for 32 and 64-bit x86 targets; see this compiled for both on the Godbolt compiler explorer. When compiling to a stand-along function, it doesn't have to save/restore any registers even in 32bit mode:
WARNING: it can read past the end of the string by up to 15 bytes. This could segfault. See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details on avoiding that: get to an alignment boundary, then use aligned loads because that's always safe if the vector contains at least 1 byte of string data. I left the code unchanged because it's interesting to discuss the effect of aligning pointers for SSE vs. AVX. Aligning pointers also avoids cache-line splits, and 4k page-splits (which are a performance pothole before Skylake).
#include <immintrin.h>
size_t strlen_sse2_asm(const char* src){
// const char *orig_src = src; // for a pointer-increment with a "+r" (src) output operand
size_t result = 0;
unsigned int tmp1;
__m128i zero = _mm_setzero_si128(), vectmp;
// A pointer-increment may perform better than an indexed addressing mode
asm(
"\n.Lloop:\n\t"
"movdqu (%[src], %[res]), %[vectmp]\n\t" // result reg is used as the loop counter
"pcmpeqb %[zerovec], %[vectmp]\n\t"
"pmovmskb %[vectmp], %[itmp]\n\t"
"add $0x10, %[res]\n\t"
"test %[itmp], %[itmp]\n\t"
"jz .Lloop\n\t"
"bsf %[itmp], %[itmp]\n\t"
"add %q[itmp], %q[res]\n\t" // q modifier to get quadword register.
// (add %edx, %rax doesn't work). But in 32bit mode, q gives a 32bit reg, so the same code works
: [res] "+r"(result), [vectmp] "=&x" (vectmp), [itmp] "=&r" (tmp1)
: [zerovec] "x" (zero) // There might already be a zeroed vector reg when inlining
, [src] "r"(src)
, [dummy] "m" (*(const char (*)[])src) // this reads the whole object, however long gcc thinks it is
: //"memory" // not needed because of the dummy input
);
return result;
// return result + tmp1; // doing the add outside the asm makes gcc sign or zero-extend tmp1.
// No benefit anyway, since gcc doesn't know that tmp1 is the offset within a 16B chunk or anything.
}
Note the dummy input, as an alternative to a "memory"
clobber, to tell the compiler that the inline asm reads the memory pointed to by src
, as well as the value of src
itself. (The compiler doesn't know what the asm does; for all it knows the asm just aligns a pointer with and
or something, so assuming that all input pointers are dereferenced would lead to missed optimizations from reordering / combining loads and stores across the asm. Also, this lets the compiler know we only read the memory, not modify it.) The GCC manual uses an example with this unspecified-length array syntax "m" (*(const char (*)[])src)
It should keep register pressure to a minimum when inlining, and doesn't tie up any special-purpose registers (like ecx
which is needed for variable-count shifts).
If you could shave another uop out of the inner loop, it would be down to 4 uops that could issue at one per cycle. As it is, 5 uops means each iteration may take 2 cycles to issue from the frontend, on Intel SnB CPUs. (Or 1.25 cycles on later CPUs like Haswell, and maybe on SnB if I was wrong about the whole-number behaviour.)
Using an aligned pointer would allow the load to fold into a memory operand for pcmpeqb
. (As well as being necessary for correctness if the string start is unaligned and the end is near the end of a page). Interestingly, using the zero-vector as the destination for pcmpeqb
is ok in theory: you don't need to re-zero the vector between iterations, because you exit the loop if it's ever non-zero. It has 1-cycle latency, so turning the zero vector into a loop-carried dependency is only a problem when cache-misses delay an old iteration. Removing this loop-carried dependency chain might help in practice, though, by letting the back end go faster when catching up after a cache miss that delayed an old iteration.
AVX solves the problem completely (except for correctness if the string ends near the end of a page). AVX allows the load to be folded even without doing an alignment check first. 3-operand non-destructive vpcmpeqb
avoids turning the zero vector into a loop-carried dependency. AVX2 would allow checking 32B at once.
Unrolling will help either way, but helps more without AVX. Align to a 64B boundary or something, and then load the whole cache line into four 16B vectors. Doing a combined check on the result of POR
ing them all together may be good, since pmovmsk
+ compare-and-branch
is 2 uops.
Using SSE4.1 PTEST
doesn't help (compared to pmovmsk
/ test
/ jnz
) because it's 2 uops and can't macro-fuse the way test
can.
PTEST
can directly test for the whole 16B vector being all-zero or all-ones (using ANDNOT -> CF part), but not if one of the byte-elements is zero. (So we can't avoid pcmpeqb
).
Have a look at Agner Fog's guides for optimizing asm, and the other links on the x86 wiki. Most optimization (Agner Fog's, and Intel's and AMD's) will mention optimizing memcpy and strlen specifically, IIRC.