Pandas Dataframe: Replacing NaN with row average

问题

I am trying to learn pandas but i have been puzzled with the following please. I want to replace NaNs is a dataframe with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything please, something I'm doing wrong? Is is because its not implemented; see link here

import pandas as pd
import numpy as np

pd.__version__
Out[44]:
'0.15.2'

In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df

Out[45]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9

In [46]:  
df.fillna(df.mean(axis=1)) 

Out[46]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9

However something like this looks to work fine

df.fillna(df.mean(axis=0)) 

Out[47]:
    c1  c2  c3
0   1   4   7
1   2   5   8
2   3   6   9

回答1:

As commented the axis argument to fillna is NotImplemented.

df.fillna(df.mean(axis=1), axis=1)

Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.

For now you'll need to iterate through:

In [11]: m = df.mean(axis=1)
         for i, col in enumerate(df):
             # using i allows for duplicate columns
             # inplace *may* not always work here, so IMO the next line is preferred
             # df.iloc[:, i].fillna(m, inplace=True)
             df.iloc[:, i] = df.iloc[:, i].fillna(m)

In [12]: df
Out[12]:
   c1  c2   c3
0   1   4  7.0
1   2   5  3.5
2   3   6  9.0

An alternative is to fillna the transpose and then transpose, which may be more efficient...

df.T.fillna(df.mean(axis=1)).T

回答2:

As an alternative, you could also use an apply with a lambda expression like this:

df.apply(lambda row: row.fillna(row.mean()), axis=1)

yielding also

    c1   c2   c3
0  1.0  4.0  7.0
1  2.0  5.0  3.5
2  3.0  6.0  9.0

回答3:

I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.

The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages. Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.

We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.

# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]

# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df

giving

    c1   c2   c3
0   1.0  4.0  7.0
1   2.0  5.0  3.5
2   3.0  6.0  9.0

来源：https://stackoverflow.com/questions/33058590/pandas-dataframe-replacing-nan-with-row-average

标签

python

pandas

dataframe

missing-data