Get the mean across multiple Pandas DataFrames

后端 未结 5 874
礼貌的吻别
礼貌的吻别 2020-12-04 15:39

I\'m generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.

相关标签:
5条回答
  • 2020-12-04 15:57

    Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...

    averages = (df1.stack()+df2.stack())/2
    averages = averages.unstack()
    

    And your done...

    Or for more general purposes...

    dfs = [df1,df2]
    averages = pd.concat([each.stack() for each in dfs],axis=1)\
                 .apply(lambda x:x.mean(),axis=1)\
                 .unstack()
    
    0 讨论(0)
  • 2020-12-04 15:59

    I go similar as @ali_m, but since you want one mean per row-column combination, I conclude differently:

    df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
    df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
    df = pd.concat([df1, df2])
    foo = df.groupby(level=1).mean()
    foo.head()
    
              x    y
    0  0.841282  2.5
    1  0.716749  1.0
    2 -0.551903  2.5
    3  1.240736  1.5
    4  1.227109  2.0
    
    0 讨论(0)
  • 2020-12-04 16:11

    You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:

    In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
    
    In [58]: df2 = df.copy()
    
    In [59]: dfs = [df, df2]
    
    In [60]: df
    Out[60]:
            a       b       c       d
    0  0.1959  0.1260  0.1464  0.1631
    1  0.9344 -1.8154  1.4529 -0.6334
    2  0.0390  0.4810  1.1779 -1.1799
    3  0.3542  0.3819 -2.0895  0.8877
    4 -2.2898 -1.0585  0.8083 -0.2126
    5  0.3727 -0.6867 -1.3440 -1.4849
    6 -1.1785  0.0885  1.0945 -1.6271
    7 -1.7169  0.3760 -1.4078  0.8994
    8  0.0508  0.4891  0.0274 -0.6369
    9 -0.7019  1.0425 -0.5476 -0.5143
    
    In [61]: for i, d in enumerate(dfs):
       ....:     d['group'] = i
       ....:
    
    In [62]: dfs[0]
    Out[62]:
            a       b       c       d  group
    0  0.1959  0.1260  0.1464  0.1631      0
    1  0.9344 -1.8154  1.4529 -0.6334      0
    2  0.0390  0.4810  1.1779 -1.1799      0
    3  0.3542  0.3819 -2.0895  0.8877      0
    4 -2.2898 -1.0585  0.8083 -0.2126      0
    5  0.3727 -0.6867 -1.3440 -1.4849      0
    6 -1.1785  0.0885  1.0945 -1.6271      0
    7 -1.7169  0.3760 -1.4078  0.8994      0
    8  0.0508  0.4891  0.0274 -0.6369      0
    9 -0.7019  1.0425 -0.5476 -0.5143      0
    
    In [63]: final = pd.concat(dfs, ignore_index=True)
    
    In [64]: final
    Out[64]:
             a       b       c       d  group
    0   0.1959  0.1260  0.1464  0.1631      0
    1   0.9344 -1.8154  1.4529 -0.6334      0
    2   0.0390  0.4810  1.1779 -1.1799      0
    3   0.3542  0.3819 -2.0895  0.8877      0
    4  -2.2898 -1.0585  0.8083 -0.2126      0
    5   0.3727 -0.6867 -1.3440 -1.4849      0
    6  -1.1785  0.0885  1.0945 -1.6271      0
    ..     ...     ...     ...     ...    ...
    13  0.3542  0.3819 -2.0895  0.8877      1
    14 -2.2898 -1.0585  0.8083 -0.2126      1
    15  0.3727 -0.6867 -1.3440 -1.4849      1
    16 -1.1785  0.0885  1.0945 -1.6271      1
    17 -1.7169  0.3760 -1.4078  0.8994      1
    18  0.0508  0.4891  0.0274 -0.6369      1
    19 -0.7019  1.0425 -0.5476 -0.5143      1
    
    [20 rows x 5 columns]
    
    In [65]: final.groupby('group').mean()
    Out[65]:
               a       b       c       d
    group
    0     -0.394 -0.0576 -0.0682 -0.4339
    1     -0.394 -0.0576 -0.0682 -0.4339
    

    Here, each group is the same, but that's only because df == df2.

    Alternatively, you can throw the frames into a Panel:

    In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
    
    In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
    
    In [71]: panel = pd.Panel({0: df, 1: df2})
    
    In [72]: panel
    Out[72]:
    <class 'pandas.core.panel.Panel'>
    Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
    Items axis: 0 to 1
    Major_axis axis: 0 to 9
    Minor_axis axis: a to d
    
    In [73]: panel.mean()
    Out[73]:
            0       1
    a  0.3839  0.2956
    b  0.1855 -0.3164
    c -0.1167 -0.0627
    d -0.2338 -0.0450
    
    0 讨论(0)
  • 2020-12-04 16:17

    Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

    import numpy as np
    import pandas as pd
    
    # some random data frames
    df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
    df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
    
    # concatenate them
    df_concat = pd.concat((df1, df2))
    
    print df_concat.mean()
    # x   -0.163044
    # y    2.120000
    # dtype: float64
    
    print df_concat.median()
    # x   -0.192037
    # y    2.000000
    # dtype: float64
    

    Update

    If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

    by_row_index = df_concat.groupby(df_concat.index)
    df_means = by_row_index.mean()
    
    print df_means.head()
    #           x    y
    # 0 -0.850794  1.5
    # 1  0.159038  1.5
    # 2  0.083278  1.0
    # 3 -0.540336  0.5
    # 4  0.390954  3.5
    

    This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.

    0 讨论(0)
  • 2020-12-04 16:19

    As per Niklas' comment, the solution to the question is panel.mean(axis=0).

    As a more complete example:

    import pandas as pd
    import numpy as np
    
    dfs = {}
    nrows = 4
    ncols = 3
    for i in range(4):
        dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
                              columns=list('abc'))
        print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
    
    panel = pd.Panel(dfs)
    print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
    

    Will give the following output:

    DF0:
       a   b   c
    0  0   1   2
    1  3   4   5
    2  6   7   8
    3  9  10  11
    
    DF1:
        a   b   c
    0   1   2   3
    1   4   5   6
    2   7   8   9
    3  10  11  12
    
    DF2:
        a   b   c
    0   2   3   4
    1   5   6   7
    2   8   9  10
    3  11  12  13
    
    DF3:
        a   b   c
    0   3   4   5
    1   6   7   8
    2   9  10  11
    3  12  13  14
    
    Mean of stacked DFs:
          a     b     c
    0   1.5   2.5   3.5
    1   4.5   5.5   6.5
    2   7.5   8.5   9.5
    3  10.5  11.5  12.5
    
    0 讨论(0)
提交回复
热议问题