Dear stackoverflow community!
Today I found that on a high-end cluster architecture, an elementwise multiplication of 2 cubes with dimensions 1921 x 512 x 512 takes ~ 27
You could use scipy.fftpack.fftn (as suggested by hpaulj too) rather than numpy.fft.fftn, looks like it's doing what you want. It is however slightly less performing:
import numpy as np
import scipy.fftpack
ran = np.random.rand(192, 51, 51) # not much memory on my laptop
a = np.fft.fftn(ran)
b = scipy.fftpack.fftn(ran)
ran.strides
(20808, 408, 8)
a.strides
(16, 3072, 156672)
b.strides
(41616, 816, 16)
timeit -n 100 np.fft.fftn(ran)
100 loops, best of 3: 37.3 ms per loop
timeit -n 100 scipy.fftpack.fftn(ran)
100 loops, best of 3: 41.3 ms per loop