pandas cut(): how to convert nans? Or to convert the output to non-categorical?

问题

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.

I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.

Any ideas?

Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object

import pandas as pd
import numpy as np

x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])

print(out_labels)
print(out_nolabels)

out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')

PS This is yet another question on how Pandas is the worst tool to handle missing data. It's like someone got together and thought: how can we make life harder for those who are stupid enough to analyse data with Python and Pandas? I know, let's remove nans from groupby, without even a warning!

回答1:

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Categorical object

You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.

We can use np.where here to overcome that

x = pd.cut(df['id'],intervals, labels=['<=4','>4'])

np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)

Or add_categories to the values i.e

x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')

[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]

If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe

df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})

df.groupby(df.id.astype(str)).mean()

Output :

     id  value
id             
1.0  1.0    5.0
4.0  4.0    7.0
nan  NaN    4.5

来源：https://stackoverflow.com/questions/47053770/pandas-cut-how-to-convert-nans-or-to-convert-the-output-to-non-categorical

标签

python

pandas

categorical-data