A while ago, I stumbled upon this 2001 DDJ article by Alexandrescu: http://www.ddj.com/cpp/184403799
It\'s about comparing various ways to initialize
Memset/memcpy are mostly written with a basic instruction set in mind, and so can be outperformed by specialized SSE routines, which on the other hand enforce certain alignment constraints.
But to reduce it to a list :
This list only comes into play for things where you need the performance. Too small/or once initialized data-sets are not worth the hassle.
Here is an implementation of memcpy from AMD, I can't find the article which described the concept behind the code.