Pandas read_csv fills empty values with string 'nan', instead of parsing date

霸气de小男生 提交于 2019-11-30 21:40:46

This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

In [22]: df
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       NaN  d
4      6  2013-3-1  d

In [23]: df.dtypes
Out[23]: 
value     int64
date     object
id       object
dtype: object

In [24]: pd.to_datetime(df['date'])
Out[24]: 
0   2013-03-01 00:00:00
1   2013-03-01 00:00:00
2   2013-03-01 00:00:00
3                   NaT
4   2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]

If the string 'nan' acutally appears in your data, you can do this:

In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])

In [32]: s
Out[32]: 
0    2013-1-1
1    2013-1-1
2         nan
3    2013-1-1
dtype: object

In [39]: s[s=='nan'] = np.nan

In [40]: s
Out[40]: 
0    2013-1-1
1    2013-1-1
2         NaN
3    2013-1-1
dtype: object

In [41]: pandas.to_datetime(s)
Out[41]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:00
2                   NaT
3   2013-01-01 00:00:00
dtype: datetime64[ns]

You can pass the na_values=["nan"] parameter in your read_csv function call. That will read the string nan values and convert them to the proper np.nan format.

See here for more info.

I got the same problem. Importing a csv file using

dataframe1 = pd.read_csv(input_file, parse_date=['date1', 'date2'])

where date1 contains valid dates while date2 is an empty column. Apparently dataframe1['date2'] is filled with a whole column of 'nan'.

The case is, after specifying the date columns from dataframe and use read_csv to import data, the empty date column will be filled with string of 'nan' instead of NaN.

The latter can be recognized by numpy and pandas as NULL while the first one couldn't.

A simple solution is:

from numpy import nan
dataframe.replace('nan', nan, inplace=True)

And then you should be good to go!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!