问题
I assign np.nan
to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan'
instead of NaN. As a result, isnull()
does not work. For example:
In [13]: df
Out[13]:
index value date
0 975 25.35 nan
1 976 26.28 nan
2 977 26.24 nan
3 978 25.76 nan
4 979 26.08 nan
In [14]: df.date.isnull()
Out[14]:
0 False
1 False
2 False
3 False
4 False
Am I doing anything wrong? Should I assign some other values instead of np.nan
to the missing values so that the isnull()
would be able to pick up?
EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN
.
EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.
In [21]: data = pd.read_csv('test.csv', parse_dates = [1])
In [22]: data
Out[22]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 nan d
4 6 2013-3-1 d
In [23]: data.date[3]
Out[23]: 'nan'
pd.to_datetime does not work either:
In [12]: data
Out[12]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 nan d
4 6 2013-3-1 d
In [13]: data.dtypes
Out[13]:
value int64
date object
id object
In [14]: pd.to_datetime(data['date'])
Out[14]:
0 2013-3-1
1 2013-3-1
2 2013-3-1
3 nan
4 2013-3-1
Name: date
Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?
回答1:
This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1
In [22]: df
Out[22]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 NaN d
4 6 2013-3-1 d
In [23]: df.dtypes
Out[23]:
value int64
date object
id object
dtype: object
In [24]: pd.to_datetime(df['date'])
Out[24]:
0 2013-03-01 00:00:00
1 2013-03-01 00:00:00
2 2013-03-01 00:00:00
3 NaT
4 2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]
If the string 'nan' acutally appears in your data, you can do this:
In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])
In [32]: s
Out[32]:
0 2013-1-1
1 2013-1-1
2 nan
3 2013-1-1
dtype: object
In [39]: s[s=='nan'] = np.nan
In [40]: s
Out[40]:
0 2013-1-1
1 2013-1-1
2 NaN
3 2013-1-1
dtype: object
In [41]: pandas.to_datetime(s)
Out[41]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00
2 NaT
3 2013-01-01 00:00:00
dtype: datetime64[ns]
回答2:
You can pass the na_values=["nan"]
parameter in your read_csv
function call. That will read the string nan values and convert them to the proper np.nan
format.
See here for more info.
回答3:
I got the same problem. Importing a csv file using
dataframe1 = pd.read_csv(input_file, parse_date=['date1', 'date2'])
where date1 contains valid dates while date2 is an empty column. Apparently dataframe1['date2'] is filled with a whole column of 'nan'.
The case is, after specifying the date columns from dataframe and use read_csv to import data, the empty date column will be filled with string of 'nan' instead of NaN.
The latter can be recognized by numpy and pandas as NULL while the first one couldn't.
A simple solution is:
from numpy import nan
dataframe.replace('nan', nan, inplace=True)
And then you should be good to go!
来源:https://stackoverflow.com/questions/16157939/pandas-read-csv-fills-empty-values-with-string-nan-instead-of-parsing-date