I have 1-d list as follows:
data = [1,5,9,13,
2,6,10,14,
3,7,11,15,
4,8,12,16]
I want to make the following list of
Listed in this post are some solution suggestions -
def grouped_mean(data,M2,N1,N2):
# Paramters:
# M2 = Columns in input data
# N1, N2 = Blocksize into which data is to be divided and averaged
# Get grouped mean values; transpose and flatten for final output
grouped_mean = np.array(data).reshape(-1,N2).sum(1).reshape(-1,N1,M2/N2).sum(1)/(N1*N2)
# Return transposed and flattened version as output (as per OP)
return grouped_mean.T.ravel()
Now, grouped_mean
could be calculated with np.einsum instead of np.sum
like so -
stage1_sum = np.einsum('ij->i',np.array(data).reshape(-1,N2))
grouped_mean = np.einsum('ijk->ik',stage1_sum.reshape(-1,N1,M2/N2))/(N1*N2)
Or, one can go in with splitting 2D input array to a 4D array as suggested in @Warren Weckesser's solution and then use np.einsum
like so -
split_data = np.array(data).reshape(-1, N1, M2/N2, N2)
grouped_mean = np.einsum('ijkl->ik',split_data)/(N1*N2)
Sample run -
In [182]: data = np.array([[1,5,9,13],
...: [2,6,10,14],
...: [3,7,11,15],
...: [4,8,12,16]])
In [183]: grouped_mean(data,4,2,2)
Out[183]: array([ 3.5, 5.5, 11.5, 13.5])
Runtime tests
Calculating grouped_mean
seems to be the most computationally intensive part of the code. So, here's some runtime tests to calculate it with those three approaches -
In [174]: import numpy as np
...: # Setup parameters and input list
...: M2 = 4000
...: N1 = 2
...: N2 = 2
...: data = np.random.randint(0,9,(16000000)).tolist()
...:
In [175]: %timeit np.array(data).reshape(-1,N2).sum(1).reshape(-1,N1,M2/N2).sum(1)/(N1*N2)
...: %timeit np.einsum('ijk->ik',np.einsum('ij->i',np.array(data).reshape(-1,N2)).reshape(-1,N1,M2/N2))/(N1*N2)
...: %timeit np.einsum('ijkl->ik',np.array(data).reshape(-1, N1, M2/N2, N2))/(N1*N2)
...:
1 loops, best of 3: 2.2 s per loop
1 loops, best of 3: 2.12 s per loop
1 loops, best of 3: 2.1 s per loop