How to “explicitly specify the categories order by passing in a categories argument” when using tuples as index keys in pandas?

问题

I've been trying to figure out how to make these tuples index keys in pandas but I'm getting an error.

How can I use the suggestion from the error with pd.Categorical below to fix this error?

I am aware that I can convert to a string but I am curious to see what is meant by the suggestion in the error message?

This works perfectly fine when I run it with 0.22.0. I've opened a GitHub issue for this if anyone wants to see the proper output from 0.22.0.

I want to update my pandas and handle this problem appropriately.

Running this with the current pandas 0.23.4:

import sys; sys.version
# '3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'
import pandas as pd; pd.__version__
# '0.23.4'
index = [(('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 8))]
len(index)
# 40
pd.Index(index)
Traceback (most recent call last):
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 635, in factorize
    order = uniques.argsort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 451, in safe_sort
    sorter = values.argsort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 345, in __init__
    codes, categories = factorize(values, sort=True)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 643, in factorize
    assume_unique=True)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 455, in safe_sort
    ordered = sort_mixed(values)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 441, in sort_mixed
    nums = np.sort(values[~str_pos])
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 847, in sort
    a.sort(axis=axis, kind=kind, order=order)
TypeError: '<' not supported between instances of 'NoneType' and 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 449, in __new__
    data, names=name or kwargs.get('names'))
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1330, in from_tuples
    return MultiIndex.from_arrays(arrays, sortorder=sortorder, names=names)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1274, in from_arrays
    labels, levels = _factorize_from_iterables(arrays)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in _factorize_from_iterables
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in <listcomp>
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 351, in __init__
    raise TypeError("'values' is not ordered, please "
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument

回答1:

The closest thing I can find to what you'd like to do is something like : pd.DataFrame(index, dtype='category').set_index([0, 1, 2]).index

Which returns the following:

MultiIndex(levels=[[('criterion', 'entropy'), ('criterion', 'gini')], [('max_features', 'log2'), ('max_features', 'sqrt'), ('max_features', None), ('max_features', 0.382)], [('min_samples_leaf', 1), ('min_samples_leaf', 2), ('min_samples_leaf', 3), ('min_samples_leaf', 5), ('min_samples_leaf', 8)]],
       labels=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
       names=[0, 1, 2])

回答2:

I wish the error message was a little more informative. Thanks to the above answers I was able to figure out the issue. I ended up doing this which is compatible with both versions:

pandas v0.23.4

>>> pd.__version__
'0.23.4'
>>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
>>> dict(index_categorical[0])
{'criterion': 'gini', 'max_features': 'log2', 'min_samples_leaf': 1}

pandas v0.22.0

>>> pd.__version__
'0.22.0'
>>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
>>> dict(index_categorical[0])
{'min_samples_leaf': 1, 'criterion': 'gini', 'max_features': 'log2'}

回答3:

I'm probably missing the point of exactly what you're trying to do, but you seem to have a nested tuple where the first part of each tuple is the column header. So I think the more obvious approach is to use (a,b,c) as the multi-index values and (x,y,z) as the multi-index names rather than ((x,a),(y,b),(z,c)) as the simple index values.

And generally speaking, pandas is somewhat likely to get confused if you put a complex data type (tuple, nested tuple, array, etc.) into a single column (whether a index column or regular column) rather than a simple data type (float, int, string, etc.). So 99.9% of the time (or maybe more!), you're better off not doing something like putting a nested tuple into a single index column. In any event, I'd do something like this for your specific example:

names = [ index[0][j][0] for j in range(3) ]
pd.DataFrame({'x':range(40)},  
    pd.MultiIndex.from_tuples( [ (i[0][1], i[1][1], i[2][1])  for i in index ],
                               names = names ) )

First 10 lines of dataframe (and as you can see it has a 3-level MultiIndex rather than a simple index of tuples or strings:

                                          x
criterion max_features min_samples_leaf    
gini      log2         1                  0
                       2                  1
                       3                  2
                       5                  3
                       8                  4
          sqrt         1                  5
                       2                  6
                       3                  7
                       5                  8
                       8                  9

FWIW, I get the same error as you if I try to use the whole tuple, instead of just the 2nd piece of each pair...

pd.DataFrame({'x':range(40)},  
    pd.MultiIndex.from_tuples( [ (i[0], i[1], i[2])  for i in index ],
                               names = names ) )

I suppose pd.Index() automatically uses from_tuples() if the inputs are tuples (?). FWIW, I only did it that way because I'm used to doing it that way, not that I think it is a better way.

来源：https://stackoverflow.com/questions/52504709/how-to-explicitly-specify-the-categories-order-by-passing-in-a-categories-argum

标签

python

pandas

indexing

tuples

multi-index