问题
Lets say I have the following dataframe:
df1 = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1],
columns = ['X'],
index = ['a', 'a', 'a',
'b', 'b', 'b',
'c', 'c', 'c'])
print(df1)
X
a 1.0
a NaN
a NaN
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.
回答1:
Let us try filter
out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
Or we do isin
df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
回答2:
As another option, let's try filtering via GroupBy.transform
and boolean indexing:
df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
Or, almost the same way,
df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
You might have a good shot at getting this to work with Dask too.
回答3:
I am new to dask , looked at some examples and docs , however the following seems to work;
from dask import dataframe as dd
sd = dd.from_pandas(df1, npartitions=3)
#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()
#using the above we can boolean index to check if sum is less than 2 , then use loc
out_dd = sd.loc[list(s[s<2].index)]
out_dd.head(6,npartitions=-1)
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
回答4:
Here is another way:
dft.loc[dft.groupby(dft.index)['X'].apply(lambda x : x.notnull().sum() > 1)]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
回答5:
I am new to Dask. I don't even have Dask installed on my laptop. I read through the documentation of Dask and found that Dask can do reset_index().
If that's allowed, here's how to approach the problem.
Step 1:
df1 = df.reset_index()
df1 will give you:
>>> df1
index X
0 a 1.0
1 a NaN
2 a NaN
3 b 1.0
4 b 1.0
5 b NaN
6 c 1.0
7 c 1.0
8 c 1.0
Now you have the index and value of X.
Step 2:
To find out which index
value has 2 or more nulls, you can do:
df1.X.isnull().groupby([df1['index']]).sum().astype(int) < 2
The result of this will be:
index
a False
b True
c True
Name: X, dtype: bool
Step 3:
You now apply this back to the original dataframe df
and the filtered records will be those with NaNs less than 2.
df.loc[(df2.X.isnull().groupby([df2['index']]).sum().astype(int) < 2)]
The result of this will be:
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
I hope Dask allows you to do this. If it does, this will be a way to get the result.
回答6:
You can use loc
with a Series of booleans:
df.loc[df['X'].notna().groupby(level=0).sum().ge(2)]
In the first step we get the Series for filtering:
mask = df['X'].notna().groupby(level=0).sum().ge(2)
Result:
a False
b True
c True
Name: X, dtype: bool
In the second step we filter using loc
:
df.loc[mask]
Result:
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
来源:https://stackoverflow.com/questions/65571812/keep-indices-in-pandas-dataframe-with-a-certain-number-of-non-nan-entires