Pandas: Mean of columns with the same names

前端未结

关注

 1  700

I have a dataframe with columns like:

[\'id\',\'name\',\'foo1\', \'foo1\', \'foo1\', \'foo2\',\'foo2\', \'foo3\']

I would like to get a new dat

相关标签:

1条回答

再見小時候

2021-02-05 20:30

The basic idea is that you can group by your columns names and do mean operations for each group.

I saw some comments for your question and tried to give you different ways to achieve the goal. (Solution (3) is the best I found!)

(1) Quick solution. If you have very limited columns that are non-numeric, and own unique names, e.g., columns id and name. What you can do is:

First set index ['id', 'name'] to preserve them,

df = df.set_index(['id', 'name'])

then use DataFrame.groupby function on columns, set axis=1 (iterate over each column), apply mean function for each group.

df.groupby(by=df.columns, axis=1).mean()

And finally, reset index to recover ['id', 'name'] columns

df = df.reset_index()

Here is a sample code:

In [35]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [36]: df = df.set_index(['id', 'name'])

In [37]: df = df.groupby(by=df.columns, axis=1).mean()

In [38]: df = df.reset_index()

In [39]: df
Out[39]: 
    id name  c1   c2    c3
0  001    a   1   55  1000
1  002    b   2  110  2000

(2) Complete solution. If you have lots of columns that are non-numeric and unique named, what you can do is:

First transpose you dataframe,

df2 = df.transpose()

Then you do group by operations (on its index and axis=0), but carefully handle each groups: for these numeric groups, return their mean value; and for these non-numeric groups, return their first row:

df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

And finally, transpose back:

df = df2.transpose()

Here is sample of code:

In [98]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [99]: df2 = df.transpose()

In [100]: df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

In [101]: df3 = df2.transpose()

In [102]: df3
Out[102]: 
  c1   c2    c3   id name
A  1   55  1000  001    a
B  2  110  2000  002    b

In [103]: df
Out[103]: 
    id name  c1  c2   c2    c3
A  001    a   1  10  100  1000
B  002    b   2  20  200  2000

You need to import numbers

More notes:

(3) All in one! This solution is the best I found:

df.groupby(by=df.columns, axis=1).apply(lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0])

I tried to handle each group for the un-transposed groups, that is,

df.groupby(by=df.columns, axis=1).apply(gf)

And

gf = lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0]

I failed before, because I do not carefully hand the axis. You must set axis=1 for mean function, and return columns for non-numeric groups.

Thanks!

0 讨论(0)