Normalize DataFrame by group

后端 未结 4 626
梦谈多话
梦谈多话 2021-02-07 02:04

Let\'s say that I have some data generated as follows:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and t

相关标签:
4条回答
  • 2021-02-07 02:26

    If the data contains many groups (thousands or more), the accepted answer may take a very long time to compute.

    Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python function for each group creates a considerable overhead.

    This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.

    We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform:

    def normalize_by_group(df, by):
        groups = df.groupby(by)
        # computes group-wise mean/std,
        # then auto broadcasts to size of group chunk
        mean = groups.transform(np.mean)
        std = groups.transform(np.std)
        return (df[mean.columns] - mean) / std
    

    For benchmarking I changed the data generation from the original question to allow for more groups:

    def gen_data(N, num_groups):
        m = 3
        data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
        indx = np.random.randint(0,num_groups,size=N).astype(np.int32)
    
        df = pd.DataFrame(np.hstack((data, indx[:,None])), 
                          columns=['a%s' % k for k in range(m)] + [ 'indx'])
        return df
    

    With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:

    In: df2g = gen_data(10000, 2)  # 3 cols, 10000 rows, 2 groups
    
    In: %timeit normalize_by_group(df2g, "indx")
    6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
    12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:

    In: df1000g = gen_data(10000, 1000)  # 3 cols, 10000 rows, 1000 groups
    
    In: %timeit normalize_by_group(df1000g, "indx")
    7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
    2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    0 讨论(0)
  • 2021-02-07 02:35

    The accepted answer works and is elegant. Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):

    means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
    df = df.merge(means_stds,on='indx')
    df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']
    

    To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:

    means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
    means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
    df = df.merge(means_stds,on='indx')
    for col in ['a0','a1']:
        df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']
    
    0 讨论(0)
  • 2021-02-07 02:37
    In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
    

    should do it.

    0 讨论(0)
  • 2021-02-07 02:39

    Although this is not the prettiest solution, you could do something like this:

    indx = df['indx'].copy()
    for indices in df.groupby('indx').groups.values():
        df.loc[indices] -= df.loc[indices].mean()
    df['indx'] = indx
    
    0 讨论(0)
提交回复
热议问题