问题
I want to calculate the mean of all the values in selected columns in a dataframe. For example, I have a dataframe with columns A, B, C, D and E and I want the mean of all the values in columns A, C and E.
import pandas as pd
df1 = pd.DataFrame( ( {'A': [1,2,3,4,5],
'B': [10,20,30,40,50],
'C': [11,21,31,41,51],
'D': [12,22,32,42,52],
'E': [13,23,33,43,53]} ) )
print( df1 )
print( "Mean of df1:", df1.mean() )
df2 = pd.concat( [df1['A'], df1['C'], df1['E'] ], ignore_index=True )
print( df2 )
print( "Mean of df2:", df2.mean() )
df3 = pd.DataFrame()
df3 = pd.concat( [ df3, df1['A'] ], ignore_index=True )
df3 = pd.concat( [ df3, df1['C'] ], ignore_index=True )
df3 = pd.concat( [ df3, df1['E'] ], ignore_index=True )
print( df3 )
print( "Mean of df3:", df3.mean() )
df2 gets me the right answer, but I need to create a new dataframe to get it.
I though something like df1['A', 'C', 'E'].mean()
would work but it returns the mean values for each column, not the combined average. Is there a way to do this without creating a new dataframe? I also need other data statistics like .std(), .min(), max() so this isn't just a one-off calculation.
回答1:
You have two options that I know of:
for mean(), min(), max() you can use mean of mean, min of min, max of max this would yield, mean, min, max of all the elements of A, C, E.
So you can use:
for mean():enter code here
df1[['A','C','E']].apply(np.mean).mean()
df1[['A','C','E']].values.mean()
Any one of the above should give you the mean of all the elements of columns A, C, E.
for min():
df1[['A','C','E']].apply(np.min).min()
df1[['A','C','E']].values.min()
For max():
df1[['A','C','E']].apply(np.max).max()
df1[['A','C','E']].values.max()
For std()
df1[['A','C','E']].apply(np.std).std() ## this will not give error, but gives a
value that is not what you want.
df1[['A','C','E']].values.std() # this gives the std of all the elements of columns A, C, E.
std of std will not give the std of all the elements.
回答2:
You can reshape DataFrame
to Series with Multiindex
by DataFrame.stack and then use mean
:
df2 = df1[['A', 'C', 'E']].stack()
print (df2)
0 A 1
C 11
E 13
1 A 2
C 21
E 23
2 A 3
C 31
E 33
3 A 4
C 41
E 43
4 A 5
C 51
E 53
dtype: int64
print( "Mean of df2:", df2.mean() )
Mean of df2: 22.333333333333332
Another idea is convert values to numpy 2d array and then use np.mean:
df21 = df1[['A', 'C', 'E']]
print( df21 )
A C E
0 1 11 13
1 2 21 23
2 3 31 33
3 4 41 43
4 5 51 53
print(df21.to_numpy())
[[ 1 11 13]
[ 2 21 23]
[ 3 31 33]
[ 4 41 43]
[ 5 51 53]]
print( "Mean of df2:", np.mean(df21.to_numpy()) )
Mean of df2: 22.333333333333332
回答3:
Caveat: only okay if the columns are of the same length. If not it would give the wrong answer (as the comments pointed out).
mean = df1[['A', 'C', 'E']].mean(axis=1).mean()
print(mean)
来源:https://stackoverflow.com/questions/61426161/get-mean-of-multiple-selected-columns-in-a-pandas-dataframe