I was curious as to whether or not there was any advantage in regards to efficiency to utilizing memset() in a situation similar to the one below.
Given the followin
memset gives a standard way to write code, letting the particular platform/compiler libraries determine the most efficient mechanism. Based on data sizes it may for example do 32-bit or 64-bit stores as much as possible.
Remember that this
for (i = 0; i < sizeof(my_buffer); i++)
{
p[i] = 0;
}
can also be faster than
for (i = 0; i < sizeof(my_buffer); i++)
{
*p++ = 0;
}
As already answered, the compiler often has hand optimized routines for memset() memcpy() and other string functions. And we are talking significantly faster. now the amount of code, number of instructions, that a fast memcpy or memset from the compiler, is usually much larger than the loop solution you suggested. fewer lines of code, fewer instructions does not mean faster.
Anyway, my message is try both. diassemble the code, see the difference, try to understand, ask questions at stack overflow if you dont. and then use a timer and time the two solutions, call whichever memcpy function thousands or hundreds of thousands of times and time the whole thing (to elminate error in the timing). Make sure you do short copies like say 7 items or 5 items, and large copies like hundreds of bytes per memset and try some prime numbers while you are at it. On some processors on some systems, your loop can be faster for a few items like 3 or 5 or something like that, very quickly though it gets slow.
Here is one hint about performance. The DDR memory in your computer is likely 64 bits wide and needs to be written 64 bits at a time, maybe it has ecc and you have to compute across those bits and write 72 bits at a time. Not always that exact number but follow the thought here it will make sense for 32 bits or 64 or 128 or whatever. If you perform a single byte write instruction to ram, the hardware is going to need to do one of two things, if there are no caches along the way, the memory system has to perform a 64 bit read, modify the one byte, then write it back. Without some sort of hardware optimization, writing 8 bytes within that one dram row, is 16 memory cycles, and dram is very very slow, dont be fooled by the 1333mhz numbers.
Now if you have a cache, the first byte write is going to require a cache line read from dram, which is one or multiple of these 64 bit reads, the next 7 or 15 or whatever byte writes are probably going to be really fast as they only go to the cache and not to ddr, eventually that cache line goes out to dram, slow, so one or two or four, etc of these 64 bit or whatever ddr locations. So even though you are only doing writes you still have to read all of that ram then write it, so twice as many cycles as desired. If possible, and it is with some processors and memory systems, the memset or the write part of a memcpy, can be single instructions with a whole cache line or whole ddr location and there is no read required, instantly doubled speed. This is not how all the optimizations work but it hopefully gives you an idea of how to think about the problem. With your program being pulled into cache in cache lines, you can double or triple the number of instructions executed if in return you half or quarter or more cutbacks on the number of DDR cycles and you win overall.
At a minimum the compiler memset and memcpy routines are going to perform a byte operation if the start address is odd then a 16 bit if not aligned on 32 bits. Then a 32 bit if not aligned on 64 and on up until they hit the optimal transfer size for that instruction set/system. On arm they tend to aim for 128 bits. So worst case on the front end would be a single byte then single halfword then a few words, then get into the main set or copy loop. In the case of ARM 128 bit transfers, 128 bits written per instruction. Then on the back end if unaligned the same deal, a few words, one half word, one byte worst case. You will also see the libraries do things like, if number of bytes is less than X where X is a small number like 13 or so, then it goes into a loop like yours, just copy some bytes because the number of instructions and clock cycles to support that loop is smaller/faster. disassemble or find the gcc source code for ARM and probably mips and some other good processors and see what I am talking about.
It depends on the quality of the compiler and the libraries. In most cases memset is superior.
The advantage of memset is that in many platforms it is actually a compiler intrinsic; that is, the compiler can "understand" the intention to set a large swath of memory to a certain value, and possibly generate better code.
In particular, that could mean using specific hardware operations for setting large regions of memory, like SSE on the x86, AltiVec on the PowerPC, NEON on the ARM, and so on. This can be an enormous performance improvement.
On the other hand, by using a for loop you are telling the compiler to do something more specific, "load this address into a register. Write a number to it. Add one to the address. Write a number to it," and so on. In theory a perfectly intelligent compiler would recognize this loop for what it is and turn it into a memset anyway; but I have never encountered a real compiler that did this.
So, the assumption is that memset was written by smart people to be the very best and fastest possible way to set a whole region of memory, for the specific platform and hardware the compiler supports. That is often, but not always, true.
This applies to both memset()
and memcpy()
:
memset()
is more readable than that loop)memset()
and memcpy()
may be the only clean solution.To expand on the 3rd point, memset()
can be heavily optimized by the compiler using SIMD and such. If you write a loop instead, the compiler will first need to "figure out" what it does before it can attempt to optimize it.
The basic idea here is that memset()
and similar library functions, in some sense, "tells" the compiler your intent.
As mentioned by @Oli in the comments, there are some downsides. I'll expand on them here:
memset()
actually does what you want. The standard doesn't say that zeros for the various datatypes are necessarily zero in memory.memset()
is restricted to only 1 byte content. So you can't use memset()
if you want to set an array of int
s to something other than zero (or 0x01010101
or something...).*I'll give one example of this from my experience:
Although memset()
and memcpy()
are usually compiler intrinsics with special handling by the compiler, they are still generic functions. They say nothing about the datatype including the alignment of the data.
So in a few (abeit rare) cases, the compiler isn't able to determine the alignment of the memory region, and thus must produce extra code to handle misalignment. Whereas, if you the programmer, is 100% sure of alignment, using a loop might actually be faster.
A common example is when using SSE/AVX intrinsics. (such as copying a 16/32-byte aligned array of float
s) If the compiler can't determine the 16/32-byte alignment, it will need to use misaligned load/stores and/or handling code. If you simply write a loop using SSE/AVX aligned load/store intrinsics, you can probably do better.
float *ptrA = ... // some unknown source, guaranteed to be 32-byte aligned
float *ptrB = ... // some unknown source, guaranteed to be 32-byte aligned
int length = ... // some unknown source, guaranteed to be multiple of 8
// memcopy() - Compiler can't read comments. It doesn't know the data is 32-byte
// aligned. So it may generate unnecessary misalignment handling code.
memcpy(ptrA, ptrB, length * sizeof(float));
// This loop could potentially be faster because it "uses" the fact that
// the pointers are aligned. The compiler can also further optimize this.
for (int c = 0; c < length; c += 8){
_mm256_store_ps(ptrA + c, _mm256_load_ps(ptrB + c));
}
Your variable p
is only required for the initialisation loop. The code for the memset should be simply
memset( my_buffer, 0, sizeof(my_buffer));
which is simpler and less error prone. The point of a void*
parameter is exactly that it will accept any pointer type, the explicit cast is unnecessary, and the assignment to a pointer of an different type is pointless.
So one benefit of using memset()
in this case is to avoid a unnecessary intermediate variable.
Another benefit is that memset() on any particular platform is likely to be optimised for the target platform, whereas your loop efficiency is dependent on the compiler and compiler settings.
Two advantages:
The version with memset
is easier to read - this is related to, but not the same as, having fewer lines of code. It takes less thinking to know what the memset
version does, especially if you write it
memset(my_buffer, 0, sizeof(my_buffer));
instead of with the indirection through p
and the unnecessary cast to void *
(NOTE: only unnecessary if you're really coding in C and not C++ - some people are unclear on the difference).
memset
is likely to be able to write 4 or 8 bytes at a time and/or take advantage of special cache hint instructions; therefore it may well be faster than your byte-at-a-time loop. (NOTE: Some compilers are clever enough to recognize a bulk-clearing loop and substitute either wider writes to memory or a call to memset
. Your mileage may vary. Always measure performance before attempting to shave cycles.)