Improving memory layout for parallel computing
问题 I'm trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out that removing one parameter from the structure into another vector (the blocked vector) gave and increase of about 10%. Anyone got any tips that can further improve this, or something i should take into consideration? Below is the most time consuming function that is executed for each timestep, and the structure used for