Pandas read_csv fills empty values with string 'nan', instead of parsing date

问题

I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example:

In [13]: df
Out[13]: 
   index  value date
0    975  25.35  nan
1    976  26.28  nan
2    977  26.24  nan
3    978  25.76  nan
4    979  26.08  nan

In [14]: df.date.isnull()
Out[14]: 
0    False
1    False
2    False
3    False
4    False

Am I doing anything wrong? Should I assign some other values instead of np.nan to the missing values so that the isnull() would be able to pick up?

EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN.

EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.

In [21]: data = pd.read_csv('test.csv', parse_dates = [1])

In [22]: data
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [23]: data.date[3]
Out[23]: 'nan'

pd.to_datetime does not work either:

In [12]: data
Out[12]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [13]: data.dtypes
Out[13]: 
value     int64
date     object
id       object

In [14]: pd.to_datetime(data['date'])
Out[14]: 
0    2013-3-1
1    2013-3-1
2    2013-3-1
3         nan
4    2013-3-1
Name: date

Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?

回答1:

This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

In [22]: df
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       NaN  d
4      6  2013-3-1  d

In [23]: df.dtypes
Out[23]: 
value     int64
date     object
id       object
dtype: object

In [24]: pd.to_datetime(df['date'])
Out[24]: 
0   2013-03-01 00:00:00
1   2013-03-01 00:00:00
2   2013-03-01 00:00:00
3                   NaT
4   2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]

If the string 'nan' acutally appears in your data, you can do this:

In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])

In [32]: s
Out[32]: 
0    2013-1-1
1    2013-1-1
2         nan
3    2013-1-1
dtype: object

In [39]: s[s=='nan'] = np.nan

In [40]: s
Out[40]: 
0    2013-1-1
1    2013-1-1
2         NaN
3    2013-1-1
dtype: object

In [41]: pandas.to_datetime(s)
Out[41]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:00
2                   NaT
3   2013-01-01 00:00:00
dtype: datetime64[ns]

回答2:

You can pass the na_values=["nan"] parameter in your read_csv function call. That will read the string nan values and convert them to the proper np.nan format.

See here for more info.

回答3:

I got the same problem. Importing a csv file using

dataframe1 = pd.read_csv(input_file, parse_date=['date1', 'date2'])

where date1 contains valid dates while date2 is an empty column. Apparently dataframe1['date2'] is filled with a whole column of 'nan'.

The case is, after specifying the date columns from dataframe and use read_csv to import data, the empty date column will be filled with string of 'nan' instead of NaN.

The latter can be recognized by numpy and pandas as NULL while the first one couldn't.

A simple solution is:

from numpy import nan
dataframe.replace('nan', nan, inplace=True)

And then you should be good to go!

来源：https://stackoverflow.com/questions/16157939/pandas-read-csv-fills-empty-values-with-string-nan-instead-of-parsing-date

标签

python

date

csv

pandas

missing-data