I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs)
Good (long) article about organizing data structures to take cache and RAM hierarchy into account from GNU's libc maintainer: https://lwn.net/Articles/250967/ (full PDF here: http://www.akkadia.org/drepper/cpumemory.pdf)