In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I
To elaborate on larsman's answer, here are some timings:
# normal C (row-major) order array
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a, axis=1)
1000 loops, best of 3: 272 us per loop
# transposing and summing along the first axis makes no real difference
# to performance
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a.T, axis=0)
1000 loops, best of 3: 269 us per loop
# however, converting to Fortran (column-major) order does improve speed...
>>> %%timeit a = np.asfortranarray(np.random.randn(500,400))
>>> np.sum(a, axis=1)
10000 loops, best of 3: 114 us per loop
# ... but only if you don't count the conversion in the timed operations
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(np.asfortranarray(a), axis=1)
1000 loops, best of 3: 599 us per loop
In summary, it might make sense to convert your arrays to Fortran order if you're going to apply a lot of operations over the columns, but the conversion itself is costly and almost certainly not worth it for a single operation.