Keep indices in Pandas DataFrame with a certain number of non-NaN entires

问题

Lets say I have the following dataframe:

df1 = pd.DataFrame(data    = [1,np.nan,np.nan,1,1,np.nan,1,1,1], 
                   columns = ['X'], 
                   index   = ['a', 'a', 'a', 
                              'b', 'b', 'b',
                              'c', 'c', 'c'])
print(df1)
     X
a  1.0
a  NaN
a  NaN
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.

回答1:

Let us try filter

out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or we do isin

df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

回答2:

As another option, let's try filtering via GroupBy.transform and boolean indexing:

df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or, almost the same way,

df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

~~You might have a good shot at getting this to work with Dask too.~~

回答3:

I am new to dask , looked at some examples and docs , however the following seems to work;

from dask import dataframe as dd 
sd = dd.from_pandas(df1, npartitions=3)

#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()

#using the above we can boolean index to check if sum is less than 2 , then use loc

out_dd = sd.loc[list(s[s<2].index)]

out_dd.head(6,npartitions=-1)

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

回答4:

Here is another way:

         dft.loc[dft.groupby(dft.index)['X'].apply(lambda x : x.notnull().sum() > 1)]


                X
           b    1.0
           b    1.0
           b    NaN
           c    1.0
           c    1.0
           c    1.0

回答5:

I am new to Dask. I don't even have Dask installed on my laptop. I read through the documentation of Dask and found that Dask can do reset_index().

If that's allowed, here's how to approach the problem.

Step 1:

df1 = df.reset_index()

df1 will give you:

>>> df1
  index    X
0     a  1.0
1     a  NaN
2     a  NaN
3     b  1.0
4     b  1.0
5     b  NaN
6     c  1.0
7     c  1.0
8     c  1.0

Now you have the index and value of X.

Step 2:

To find out which index value has 2 or more nulls, you can do:

df1.X.isnull().groupby([df1['index']]).sum().astype(int) < 2

The result of this will be:

index
a    False
b     True
c     True
Name: X, dtype: bool

Step 3:

You now apply this back to the original dataframe df and the filtered records will be those with NaNs less than 2.

df.loc[(df2.X.isnull().groupby([df2['index']]).sum().astype(int) < 2)]

The result of this will be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I hope Dask allows you to do this. If it does, this will be a way to get the result.

回答6:

You can use loc with a Series of booleans:

df.loc[df['X'].notna().groupby(level=0).sum().ge(2)]

In the first step we get the Series for filtering:

mask = df['X'].notna().groupby(level=0).sum().ge(2)

Result:

a    False
b     True
c     True
Name: X, dtype: bool

In the second step we filter using loc:

df.loc[mask]

Result:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

来源：https://stackoverflow.com/questions/65571812/keep-indices-in-pandas-dataframe-with-a-certain-number-of-non-nan-entires

标签

python

pandas

dask