Background: I\'m trying to create a pure D language implementation of functionality that\'s roughly equivalent to C\'s memchr but uses arrays and indices i
memchr like memset and memcpy generally reduce to fairly small amount of machine code. You are unlikely to be able to reproduce that kind of speed without inlining similar assembly code. One major issue to consider in an implementation is data alignment.
One generic technique you may be able to use is to insert a sentinel at the end of the string being searched, which guarantees that you will find it. It allows you to move the test for end of string from inside the loop, to after the loop.
GNU libc definitely uses the assembly version of memchr() (on any common linux distro). This is why it is so unbelievable fast.
For example, if we count lines in 11Gb file (like "wc -l" does) it takes around 2.5 seconds with assembly version of memchr() from GNU libc. But if we replace memchr() assembly call with for example memchr() C implementation from FreeBSD - the speed will decrease to like 30 seconds.
This is equal to replacing memchr() with just a while loop which compares one char after another.
Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.
void *
memchr(s, c, n)
const void *s;
unsigned char c;
size_t n;
{
if (n != 0) {
const unsigned char *p = s;
do {
if (*p++ == c)
return ((void *)(p - 1));
} while (--n != 0);
}
return (NULL);
}
I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.
The x86-64 SSE2 version combines the results from pcmpeqb on a whole cache-line of data at once (four 16B vectors), to amortize the overhead of the early-exit pmovmskb
/test
/jcc
.
gcc and clang are currently incapable of auto-vectorizing loops with if() break
early-exit conditions, so they make naive byte-at-a-time asm from the obvious C implementation.
This implementation of memchr from newlib is one example of someone's optimizing memchr: it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).
Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).