Convert array of string (category) to array of int from a pandas dataframe

前端 未结 4 1590
孤城傲影
孤城傲影 2020-12-14 11:25

I am trying to do something very similar to that previous question but I get an error. I have a pandas dataframe containing features,label I need to do some convertion to se

相关标签:
4条回答
  • 2020-12-14 12:06

    The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.

    For a Series:

    In [1]: import pandas as pd
    In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
                           'touching', 'single', 'nuclei'])
    In [3]: s_enc = pd.factorize(s)
    In [4]: s_enc[0]
    Out[4]: array([0, 1, 2, 3, 1, 0, 2])
    In [5]: s_enc[1]
    Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
    

    For a DataFrame:

    In [1]: import pandas as pd
    In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei', 
                           'dusts', 'touching', 'single', 'nuclei']})
    In [3]: catenc = pd.factorize(df['labels'])
    In [4]: catenc
    Out[4]: (array([0, 1, 2, 3, 1, 0, 2]), 
            Index([u'single', u'touching', u'nuclei', u'dusts'],
            dtype='object'))
    In [5]: df['labels_enc'] = catenc[0]
    In [6]: df
    Out[4]:
             labels  labels_enc
        0    single           0
        1  touching           1
        2    nuclei           2
        3     dusts           3
        4  touching           1
        5    single           0
        6    nuclei           2
    
    0 讨论(0)
  • 2020-12-14 12:14

    If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):

    In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
    
    In [2]: s
    Out[2]: 
    0    single
    1    touching
    2    nuclei
    3    dusts
    4    touching
    5    single
    6    nuclei
    Name: None, Length: 7
    
    In [4]: Factor(s)
    Out[4]: 
    Factor:
    array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
    Levels (4): [dusts nuclei single touching]
    

    The factor has attributes labels and levels:

    In [7]: f = Factor(s)
    
    In [8]: f.labels
    Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
    
    In [9]: f.levels
    Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
    

    This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.

    BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.

    0 讨论(0)
  • 2020-12-14 12:16

    I am answering the question for Pandas 0.10.1. Factor.from_array seems to do the trick.

    >>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
    >>> s
    0    a
    1    b
    2    a
    3    c
    4    a
    5    b
    6    a
    >>> f = pandas.Factor.from_array(s)
    >>> f
    Categorical: 
    array([a, b, a, c, a, b, a], dtype=object)
    Levels (3): Index([a, b, c], dtype=object)
    >>> f.labels
    array([0, 1, 0, 2, 0, 1, 0])
    >>> f.levels
    Index([a, b, c], dtype=object)
    
    0 讨论(0)
  • 2020-12-14 12:26

    because none of these work for dimensions>1, I made some code working for any numpy array dimensionality:

    def encode_categorical(array):
        d = {key: value for (key, value) in zip(np.unique(array), np.arange(len(u)))}
        shape = array.shape
        array = array.ravel()
        new_array = np.zeros(array.shape, dtype=np.int)
        for i in range(len(array)):
            new_array[i] = d[array[i]]
        return new_array.reshape(shape)
    
    0 讨论(0)
提交回复
热议问题