I import pandas as pd and run the code below and get the following result
Code:
traindataset = pd.read_csv('/Users/train.csv')
print traindataset.dtypes
print traindataset.shape
print traindataset.iloc[25,3]
traindataset.dropna(how='any')
print traindataset.iloc[25,3]
print traindataset.shape
Output
TripType int64
VisitNumber int64
Weekday object
Upc float64
ScanCount int64
DepartmentDescription object
FinelineNumber float64
dtype: object
(647054, 7)
nan
nan
(647054, 7)
[Finished in 2.2s]
From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN in the dataframe. How that comes? I am craaaazy right now.
You need to read the documentation (emphasis added):
Return object with labels on given axis omitted
dropna returns a new DataFrame. If you want it to modify the existing DataFrame, all you have to do is read further in the documentation:
inplace : boolean, default False
If True, do operation inplace and return None.
So to modify it in place, do traindataset.dropna(how='any', inplace=True).
Alternatively, you can also use notnull() method to select the rows which are not null.
For example if you want to select Non null values from columns country and variety of the dataframe reviews:
answer=reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
But here we are just selecting relevant data;to remove null values you should use dropna() method.
pd.DataFrame.dropna uses inplace=False by default. This is the norm with most Pandas operations; exceptions do exist, e.g. update.
Therefore, you must either assign back to your variable, or state explicitly inplace=True:
df = df.dropna(how='any') # assign back
df.dropna(how='any', inplace=True) # set inplace parameter
Stylistically, the former is often preferred as it supports operator chaining, and the latter often does not yield any or significant performance benefits.
来源:https://stackoverflow.com/questions/33643843/cant-drop-nan-with-dropna-in-pandas