Keep indices in Pandas DataFrame with a certain number of non-NaN entires

北慕城南 提交于 2021-01-22 05:02:03

问题


Lets say I have the following dataframe:

df1 = pd.DataFrame(data    = [1,np.nan,np.nan,1,1,np.nan,1,1,1], 
                   columns = ['X'], 
                   index   = ['a', 'a', 'a', 
                              'b', 'b', 'b',
                              'c', 'c', 'c'])
print(df1)
     X
a  1.0
a  NaN
a  NaN
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.


回答1:


Let us try filter

out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or we do isin

df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0



回答2:


As another option, let's try filtering via GroupBy.transform and boolean indexing:

df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or, almost the same way,

df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

You might have a good shot at getting this to work with Dask too.




回答3:


I am new to dask , looked at some examples and docs , however the following seems to work;

from dask import dataframe as dd 
sd = dd.from_pandas(df1, npartitions=3)

#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()

#using the above we can boolean index to check if sum is less than 2 , then use loc

out_dd = sd.loc[list(s[s<2].index)]

out_dd.head(6,npartitions=-1)

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0



回答4:


Here is another way:

         dft.loc[dft.groupby(dft.index)['X'].apply(lambda x : x.notnull().sum() > 1)]


                X
           b    1.0
           b    1.0
           b    NaN
           c    1.0
           c    1.0
           c    1.0



回答5:


I am new to Dask. I don't even have Dask installed on my laptop. I read through the documentation of Dask and found that Dask can do reset_index().

If that's allowed, here's how to approach the problem.

Step 1:

df1 = df.reset_index()

df1 will give you:

>>> df1
  index    X
0     a  1.0
1     a  NaN
2     a  NaN
3     b  1.0
4     b  1.0
5     b  NaN
6     c  1.0
7     c  1.0
8     c  1.0

Now you have the index and value of X.

Step 2:

To find out which index value has 2 or more nulls, you can do:

df1.X.isnull().groupby([df1['index']]).sum().astype(int) < 2

The result of this will be:

index
a    False
b     True
c     True
Name: X, dtype: bool

Step 3:

You now apply this back to the original dataframe df and the filtered records will be those with NaNs less than 2.

df.loc[(df2.X.isnull().groupby([df2['index']]).sum().astype(int) < 2)]

The result of this will be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I hope Dask allows you to do this. If it does, this will be a way to get the result.




回答6:


You can use loc with a Series of booleans:

df.loc[df['X'].notna().groupby(level=0).sum().ge(2)]

In the first step we get the Series for filtering:

mask = df['X'].notna().groupby(level=0).sum().ge(2)

Result:

a    False
b     True
c     True
Name: X, dtype: bool

In the second step we filter using loc:

df.loc[mask]

Result:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0


来源:https://stackoverflow.com/questions/65571812/keep-indices-in-pandas-dataframe-with-a-certain-number-of-non-nan-entires

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!