I have a dataframe with some columns containing nan. I\'d like to drop those columns with certain number of nan. For example, in the following code, I\'d like to drop any co
There is a thresh
param for dropna, you just need to pass the length of your df - the number of NaN
values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop
-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)