I am searching for a faster method of accomplishing this:
int is_empty(char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[
On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.
Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.
Expressing this is difficult in portable C: casting a char*
to long*
violates strict aliasing. But fortunately you can use memcpy
to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.
For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t
vars (often 64-bit) with memcpy and then checking tmp1 | tmp2
, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.
Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy
for loads where it can't prove alignment.
The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).
int is_empty(char * buf, int size)
{
int i, content=0;
for(i = 0; !content && i < size; i++)
{
content=content | buf(i); // bitwise or
}
return (content==0);
}
For something so simple, you'll need to see what code the compiler is generating.
$ gcc -S -O3 -o empty.s empty.c
And the contents of the assembly:
.text
.align 4,0x90
.globl _is_empty
_is_empty:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %edx ; edx = pointer to buffer
movl 8(%ebp), %ecx ; ecx = size
testl %edx, %edx
jle L3
xorl %eax, %eax
cmpb $0, (%ecx)
jne L5
.align 4,0x90
L6:
incl %eax ; real guts of the loop are in here
cmpl %eax, %edx
je L3
cmpb $0, (%ecx,%eax) ; compare byte-by-byte of buffer
je L6
L5:
leave
xorl %eax, %eax
ret
.align 4,0x90
L3:
leave
movl $1, %eax
ret
.subsections_via_symbols
This is very optimized. The loop does three things:
It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.
When all else fails, measure first, don't guess.
The initial C algorithm is pretty much as slow as it can be in VALID C. If you insist on using C then try a "while" loop instead of "for":
int i = 0;
while (i< MAX)
{
// operate on the string
i++;
}
This is pretty much the fastest 1 dimensional string operation loop you can write in C, besides if you can force the compiler to put i in a register with the "register" keyword, but I am told that this is almost always ignored by modern compilers.
Also searching a constant sized array to check if it is empty is very wasteful and also 0 is not empty, it is value in the array.
A better solution for speed would to use a dynamic array (int* piBuffer) and a variable that stores the current size (unsigned int uiBufferSize), when the array is empty then the pointer is NULL, and uiBufferSize is 0. Make a class with these two as protected member variables. One could also easily write a template for dynamic arrays, which would store 32 bit values, either primitive types or pointers, for primitive types there is not really any way to test for "empty" (I interpret this as "undefined"), but you can of course define 0 to represent an available entry. For an array pointers you should initialize all entries to NULL, and set entry to NULL when you have just deallocated that memory. And NULL DOES mean "points at nothing" so this is very convenient way to represent empty. One should not use dynamically resized arrays in really complicated algorithms, at least not in the development phase, there are simply too many things that can go wrong. One should at least first implement the algorithm using an STL Container (or well tested alternative) and then when the code works one can swap the tested container for a simple dynamic array (and if you can avoid resizing the array too often the code will both be faster and more fail safe.
A better solution for complicated and cool code is to use either std::vector or a std::map (or any container class STL, homegrown or 3rd party) depending on your needs, but looking at your code I would say that the std::vector is enough. The STL Containers are templates so they should be pretty fast too. Use STL Container to store object pointers (always store object pointers and not the actual objects, copying entire objects for every entry will really mess up your execution speed) and dynamic arrays for more basic data (bitmap, sound etc.) ie primitive types. Generally.
I came up with the REPE SCASW solution independtly by studying x86 assembly language manuals, and I agree that the example using this string operation instruction is the fastest. The other assembly example which has separate compare, jump etc. instructions is almost certainly slower (but still much faster than the initial C code, so still a good post), as the string operations are among the most highly optimized on all modern CPUs, they may even have their own logic circuitry (anyone knows?).
The REPE SCASD does not need to fetch a new instruction nor increase the instruction pointer, and that is just the stuff an assembly novice like me can come up with and and on top of that is the hardware optimization, string operations are critical for almost all kinds of modern software in particular multimedia application (copy PCM sound data, uncompressed bitmap data, etc.), so optimizing these instructions must have been very high priority every time a new 80x86 chip was being designed. I use it for a novel 2d sprite collision algorithm.
It says that I am not allowed to have an opinion, so consider the following an objective assessment: Modern compilers (UNMANAGED C/C++, pretty much everything else is managed code and is slow as hell) are pretty good at optimizing, but it cannot be avoided that for VERY specific tasks the compiler generates redundant code. One could look at the assembly that the compiler outputs so that one does not have to translate a complicated algorithm entirely from scratch, even though it is very fun to do (for some) and it is much more rewarding doing code the hard way, but anyway, algorithms using "for" loops, in particular with regards to string operations, can often be optimized very significantly as the for loop generates a lot of code, that is often not needed, example: for (int i = 1000; i>0; i--) DoSomething(); This line generates at 6-10 lines of assembly if the compiler is not very clever (it might be), but the optimized assembly version CAN be:
mov cx, 1000
_DoSomething:
// loop code....or call Func, slower but more readable
loop _DoSomething
That was 2 lines, and it does exactly the same as the C line (it uses registers instead of memory addresses, which is MUCH faster, but arguably this is not EXACTLY the same as the C line, but that is semantics) , how much of an optimization this example is depends on how well modern compilers optimize, which I have no clue on, but the algorithm analysis based on the goal of implementing an algorithm with the fewest and faster assembly lines often works well, I have had very good results with first implementing the algorithm in C/C++ without caring about optimization and then translate and optimize it in assembly. The fact that each C line becomes many assembly lines often makes some optimizations very obvious, and also some instructions are faster than others:
INC DX ; is faster than:
ADD DX,1 ;if ADD DX,1 is not just replaced with INC DX by the assembler or the CPU
LOOP ; is faster than manually decreasing, comparing and jumping
REPxx STOSx/MOVSx/LODSx is faster than using cmp, je/jne/jea etc and loop
JMP or conditional jumping is faster than using CALL, so in a loop that is executed VERY frequently (like rendering), including functions in the code so it is accessible with "local" jumps can also boost performance.
The last bit is very relevant for this question, fast string operations. So this post is not all rambling.
And lastly, design you assembly algorithm in the way that requires the least amount of jumps for a typical execution.
Also don't bother optimizing code that is not called that often, use a profiler and see what code is called most often, and start with that, anything that is called less than 20 times a second (and completes much faster than 1000 ms/ 20) is not really worth optimizing. Look at code that it not synchronized to timers and the like and is executed again immediately after is has completed. On the other hand if your rendering loop can do 100+ FPS on a modest machine, it does not make sense economically to optimize it, but real coders love to code and do not care about economics, they optimize the AppStart() method into 100% assembly even though it is only called once :) Or use a z rotation matrix to rotate Tetris pieces 90 degrees :P Anyone who does that is awesome!
If anyone has some constructive correction, which is not VERY hurtful, then I would love to hear it, I code almost entirely by myself, so I am not really exposed to any influences. I once paid a nice Canadian game developer to teach my Direct3d and though I could just as easily have read a book, the interaction with another coder who was somewhat above my level in certain areas was fun.
Thanks for good content generally. I think I will go and answer some of the simpler questions, give a little back.
If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.
Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).
From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):
_asm {
CLD ; search forward
XOR EAX, EAX ; search for non-zero
LEA EDI, [buf] ; search in buf
MOV ECX, [buflen] ; search buflen bytes
SHR ECX, 2 ; using dwords so len/=4
REPE SCASD ; perform scan
JCXZ bufferEmpty: ; completes? then buffer is 0
}
Tomas
EDIT: updated with Tony D's fixes