Pandas Categorical data type not behaving as expected

问题

I have the Pandas (version 0.15.2) dataframe below. I want to make the code column an ordered variable of type Categorical after the df creation as below.

import pandas as pd
df = pd.DataFrame({'id' : range(1,9),
                    'code' : ['one', 'one', 'two', 'three',
                                'two', 'three', 'one', 'two'],
                    'amount' : np.random.randn(8)},  columns= ['id','code','amount'])

df.code = df.code.astype('category')
>> 0      one
>> 1      one
>> 2      two
>> 3    three
>> 4      two
>> 5    three
>> 6      one
>> 7      two
>> Name: code, dtype: category
>> Categories (3, object): [one < three < two]

So this works, but only partially. I cannot impose the order. All functionality below, which are demonstrated on the documentation webpage, throw syntax errors for me:

df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)
>> error: astype() got an unexpected keyword argument 'categories'

Or even:

df.code.ordered
>> error: 'Series' object has no attribute 'ordered'
df.code.categories
>> error: 'Series' object has no attribute 'categories'

1) This is annoying. I cannot even get the categories (levels) of my Categorical variable. Am I doing something wrong or is the web documentation out of date/ inconsistent?

2) Also, do you know whether the type Categorical has a distance notion, i.e. does Pandas know that based on the ordering above, one is closer to two than three? I plan to use this for (dis)similarity calculation.

回答1:

Here's a short example with an ordered categorical variable and (to me) a surprising result from using rank() (as a sort of distance measure):

df = pd.DataFrame({ 'code':['one','two','three','one'], 'num':[1,2,3,1] }) 
df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)

    code  num
0    one    1
1    two    2
2  three    3
3    one    1

df.sort('code')

    code  num
0    one    1
3    one    1
1    two    2
2  three    3

So sort() works as expected, in the order specified. But rank() doesn't do what I would have guessed, it ranks lexicographically and ignores the ordering of the categorical variable.

 df.sort('code').rank()

   code  num
0   1.5  1.5
3   1.5  1.5
1   4.0  3.0
2   3.0  4.0

All of which is perhaps a longer way of asking: Maybe you just want an integer type? I mean, you could make up some kind of distance function here post-sorting, but ultimately that's going to be a lot more work than what you could do with a standard int or float (and possibly problematic if you look at how rank() handles an ordered categorical.

edit to add: Part of the above may not work for pandas 15.2 but I believe you can still do this to specify order:

df['code'].cat.categories = ['one','two','three']

What will happen in 15.2 by default (as I understand it) is that ordered will be True by default (but False in version 16.0), but order will be lexicographical rather than as specified in the constructor. I'm not sure though, and am working in 16.0 so you'll have to just observe how your version behaves. Remember that Categorical is still fairly new...

回答2:

I don't think you can specify an order, pd.factorize appears to give that option, but it is not implemented, see here.

Based on what you described, you are looking for coding the code variable into an ordinal variable, not a categorical variable, which are slightly different.

If you can assume the difference between 'one' and 'two' is equal to that between 'two' and 'three'. I guess you can just code them into ints (0, 1, 2, 3 ...).

If you use patsy, then there is a nice example for ordinal variables

来源：https://stackoverflow.com/questions/29829427/pandas-categorical-data-type-not-behaving-as-expected

标签

python

pandas

categorical-data

ordinal