问题
I've been trying to figure out how to make these tuples index keys in pandas
but I'm getting an error.
How can I use the suggestion from the error with pd.Categorical
below to fix this error?
I am aware that I can convert to a string but I am curious to see what is meant by the suggestion in the error message?
This works perfectly fine when I run it with 0.22.0
. I've opened a GitHub issue for this if anyone wants to see the proper output from 0.22.0
.
I want to update my pandas and handle this problem appropriately.
Running this with the current pandas 0.23.4:import sys; sys.version
# '3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'
import pandas as pd; pd.__version__
# '0.23.4'
index = [(('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 8))]
len(index)
# 40
pd.Index(index)
Traceback (most recent call last):
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 635, in factorize
order = uniques.argsort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 451, in safe_sort
sorter = values.argsort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 345, in __init__
codes, categories = factorize(values, sort=True)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/util/_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 643, in factorize
assume_unique=True)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 455, in safe_sort
ordered = sort_mixed(values)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 441, in sort_mixed
nums = np.sort(values[~str_pos])
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 847, in sort
a.sort(axis=axis, kind=kind, order=order)
TypeError: '<' not supported between instances of 'NoneType' and 'str'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 449, in __new__
data, names=name or kwargs.get('names'))
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1330, in from_tuples
return MultiIndex.from_arrays(arrays, sortorder=sortorder, names=names)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1274, in from_arrays
labels, levels = _factorize_from_iterables(arrays)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in _factorize_from_iterables
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in <listcomp>
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
cat = Categorical(values, ordered=True)
File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 351, in __init__
raise TypeError("'values' is not ordered, please "
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument
回答1:
The closest thing I can find to what you'd like to do is something like : pd.DataFrame(index, dtype='category').set_index([0, 1, 2]).index
Which returns the following:
MultiIndex(levels=[[('criterion', 'entropy'), ('criterion', 'gini')], [('max_features', 'log2'), ('max_features', 'sqrt'), ('max_features', None), ('max_features', 0.382)], [('min_samples_leaf', 1), ('min_samples_leaf', 2), ('min_samples_leaf', 3), ('min_samples_leaf', 5), ('min_samples_leaf', 8)]],
labels=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=[0, 1, 2])
回答2:
I wish the error message was a little more informative. Thanks to the above answers I was able to figure out the issue. I ended up doing this which is compatible with both versions:
pandas v0.23.4
>>> pd.__version__
'0.23.4'
>>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
>>> dict(index_categorical[0])
{'criterion': 'gini', 'max_features': 'log2', 'min_samples_leaf': 1}
pandas v0.22.0
>>> pd.__version__
'0.22.0'
>>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
>>> dict(index_categorical[0])
{'min_samples_leaf': 1, 'criterion': 'gini', 'max_features': 'log2'}
回答3:
I'm probably missing the point of exactly what you're trying to do, but you seem to have a nested tuple where the first part of each tuple is the column header. So I think the more obvious approach is to use (a,b,c)
as the multi-index values and (x,y,z)
as the multi-index names rather than ((x,a),(y,b),(z,c))
as the simple index values.
And generally speaking, pandas is somewhat likely to get confused if you put a complex data type (tuple, nested tuple, array, etc.) into a single column (whether a index column or regular column) rather than a simple data type (float, int, string, etc.). So 99.9% of the time (or maybe more!), you're better off not doing something like putting a nested tuple into a single index column. In any event, I'd do something like this for your specific example:
names = [ index[0][j][0] for j in range(3) ]
pd.DataFrame({'x':range(40)},
pd.MultiIndex.from_tuples( [ (i[0][1], i[1][1], i[2][1]) for i in index ],
names = names ) )
First 10 lines of dataframe (and as you can see it has a 3-level MultiIndex rather than a simple index of tuples or strings:
x
criterion max_features min_samples_leaf
gini log2 1 0
2 1
3 2
5 3
8 4
sqrt 1 5
2 6
3 7
5 8
8 9
FWIW, I get the same error as you if I try to use the whole tuple, instead of just the 2nd piece of each pair...
pd.DataFrame({'x':range(40)},
pd.MultiIndex.from_tuples( [ (i[0], i[1], i[2]) for i in index ],
names = names ) )
I suppose pd.Index()
automatically uses from_tuples()
if the inputs are tuples (?). FWIW, I only did it that way because I'm used to doing it that way, not that I think it is a better way.
来源:https://stackoverflow.com/questions/52504709/how-to-explicitly-specify-the-categories-order-by-passing-in-a-categories-argum