How to do discretization of continuous attributes in sklearn?

后端 未结 5 466
萌比男神i
萌比男神i 2021-01-02 08:41

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su

相关标签:
5条回答
  • 2021-01-02 08:57

    The answer is no. There is no binning in scikit-learn. As eickenberg said, you might want to use np.histogram. Features in scikit-learn are assumed to be continuous, not discrete. The main reason why there is no binning is probably that most of sklearn is developed on text, image featuers or dataset from the scientific community. In these settings, binning is rarely helpful. Do you know of a freely available dataset where binning is really beneficial?

    0 讨论(0)
  • 2021-01-02 09:06

    You may also consider rendering the Categorical variables numerical, e.g. via indicator variables, a procedure also known as one hot encoding

    Try

    from sklearn.preprocessing import OneHotEncoder
    

    and fit it to your categorical data, followed by a numerical estimation method such as linear regression. As long as there aren't too many categories (city may be a little too much), this can work well.

    As for discretization of continuous variables, you may consider binning using an adapted bin size, or, equivalently, uniform binning after histogram normalization. numpy.histogram may be helpful here. Also, while Fayyad-Irani clustering isn't implemented in sklearn, feel free to check out sklearn.cluster for adaptive discretizations of your data (even if it is only 1D), e.g. via KMeans .

    0 讨论(0)
  • 2021-01-02 09:11

    Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:

    • Uniformly-sized bins
    • Bins with "equal" numbers of samples inside (as much as possible)
    • Bins based on K-means clustering

    Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:

    import numpy as np
    import pandas as pd
    n_samples = 10
    a = np.random.randint(0, 10, n_samples)
    
    # say you want to split at 1 and 3
    boundaries = [1, 3]
    # add min and max values of your data
    boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))
    
    a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
    a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
    a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
    print(a, '\n')
    print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
    print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
    print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')
    

    which produces:

    [2 2 9 7 2 9 3 0 4 0]
    
    [[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
    Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
     category
    
    [1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
    Categories (3, int64): [0 < 1 < 2]
     category
    
    [1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
     float64
    

    Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).

    My example uses integer data, but it should work just as fine with floats.

    0 讨论(0)
  • 2021-01-02 09:16

    Thanks to the ideas above;

    To Discretizate continuous values, you may utilize:

    1. the Pandas cut or qcut functions (input array Must be 1-dimensional)

    or

    1. the sklearn's KBinsDiscretizer function (with parameter encode set to ‘ordinal’)

      • parameter strategy = uniform will discretize in the same manner as pd.cut
      • parameter strategy = quantile will discretize in the same manner as pd.qcut function

    Since examples for cut/qcut are provided in previous answers, here let's go on with a clean example on KBinsDiscretizer:

    import numpy as np
    from sklearn.preprocessing import KBinsDiscretizer
    
    A = np.array([[24,0.2],[35,0.3],[74,0.4], [96,0.5],[2,0.6],[39,0.8]])
    print(A)
    # [[24.   0.2]
    #  [35.   0.3]
    #  [74.   0.4]
    #  [96.   0.5]
    #  [ 2.   0.6]
    #  [39.   0.8]]
    
    
    enc = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
    enc.fit(A)
    print(enc.transform(A))
    # [[0. 0.]
    #  [1. 0.]
    #  [2. 1.]
    #  [2. 1.]
    #  [0. 2.]
    #  [1. 2.]]
    

    As shown in the output, each feature has been discretized into 3 bins. Hope this helped :)


    Final notes:

    • To compare cut versus qcut, see this post
    • For your categorical features, you can utilize KBinsDiscretizer(encode='onehot') to perform one-hot encoding on that feature
    0 讨论(0)
  • 2021-01-02 09:21

    you could using pandas.cut method, like this:

    bins = [0, 4, 10, 30, 45, 99999]
    labels = ['Very_Low_Fare', 'Low_Fare', 'Med_Fare', 'High_Fare','Very_High_Fare']
    train_orig.Fare[:10]
    Out[0]: 
    0     7.2500
    1    71.2833
    2     7.9250
    3    53.1000
    4     8.0500
    5     8.4583
    6    51.8625
    7    21.0750
    8    11.1333
    9    30.0708
    Name: Fare, dtype: float64
    
    pd.cut(train_orig.Fare, bins=bins, labels=labels)[:10]
    Out[50]: 
    0          Low_Fare
    1    Very_High_Fare
    2          Low_Fare
    3    Very_High_Fare
    4          Low_Fare
    5          Low_Fare
    6    Very_High_Fare
    7          Med_Fare
    8          Med_Fare
    9         High_Fare
    Name: Fare, dtype: category
    Categories (5, object): [High_Fare < Low_Fare < Med_Fare < Very_High_Fare < Very_Low_Fare]
    
    0 讨论(0)
提交回复
热议问题