Impute categorical missing values in scikit-learn

后端 未结 10 1355
清歌不尽
清歌不尽 2020-11-30 16:55

I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle

相关标签:
10条回答
  • 2020-11-30 17:26

    To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

    import pandas as pd
    import numpy as np
    
    from sklearn.base import TransformerMixin
    
    class DataFrameImputer(TransformerMixin):
    
        def __init__(self):
            """Impute missing values.
    
            Columns of dtype object are imputed with the most frequent value 
            in column.
    
            Columns of other types are imputed with mean of column.
    
            """
        def fit(self, X, y=None):
    
            self.fill = pd.Series([X[c].value_counts().index[0]
                if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
                index=X.columns)
    
            return self
    
        def transform(self, X, y=None):
            return X.fillna(self.fill)
    
    data = [
        ['a', 1, 2],
        ['b', 1, 1],
        ['b', 2, 2],
        [np.nan, np.nan, np.nan]
    ]
    
    X = pd.DataFrame(data)
    xt = DataFrameImputer().fit_transform(X)
    
    print('before...')
    print(X)
    print('after...')
    print(xt)
    

    which prints,

    before...
         0   1   2
    0    a   1   2
    1    b   1   1
    2    b   2   2
    3  NaN NaN NaN
    after...
       0         1         2
    0  a  1.000000  2.000000
    1  b  1.000000  1.000000
    2  b  2.000000  2.000000
    3  b  1.333333  1.666667
    
    0 讨论(0)
  • 2020-11-30 17:27

    Inspired by the answers here and for the want of a goto Imputer for all use-cases I ended up writing this. It supports four strategies for imputation mean, mode, median, fill works on both pd.DataFrame and Pd.Series.

    mean and median works only for numeric data, mode and fill works for both numeric and categorical data.

    class CustomImputer(BaseEstimator, TransformerMixin):
        def __init__(self, strategy='mean',filler='NA'):
           self.strategy = strategy
           self.fill = filler
    
        def fit(self, X, y=None):
           if self.strategy in ['mean','median']:
               if not all(X.dtypes == np.number):
                   raise ValueError('dtypes mismatch np.number dtype is \
                                     required for '+ self.strategy)
           if self.strategy == 'mean':
               self.fill = X.mean()
           elif self.strategy == 'median':
               self.fill = X.median()
           elif self.strategy == 'mode':
               self.fill = X.mode().iloc[0]
           elif self.strategy == 'fill':
               if type(self.fill) is list and type(X) is pd.DataFrame:
                   self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
           return self
    
       def transform(self, X, y=None):
           return X.fillna(self.fill)
    

    usage

    >> df   
        MasVnrArea  FireplaceQu
    Id  
    1   196.0   NaN
    974 196.0   NaN
    21  380.0   Gd
    5   350.0   TA
    651 NaN     Gd
    
    
    >> CustomImputer(strategy='mode').fit_transform(df)
    MasVnrArea  FireplaceQu
    Id      
    1   196.0   Gd
    974 196.0   Gd
    21  380.0   Gd
    5   350.0   TA
    651 196.0   Gd
    
    >> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
    MasVnrArea  FireplaceQu
    Id      
    1   196.0   NA
    974 196.0   NA
    21  380.0   Gd
    5   350.0   TA
    651 0.0     Gd 
    
    0 讨论(0)
  • 2020-11-30 17:31

    sklearn.impute.SimpleImputer instead of Imputer can easily resolve this, which can handle categorical variable.

    As per the Sklearn documentation: If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

    https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

    impute_size=SimpleImputer(strategy="most_frequent") 
    data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])
    
    0 讨论(0)
  • 2020-11-30 17:34

    Copying and modifying sveitser's answer, I made an imputer for a pandas.Series object

    import numpy
    import pandas 
    
    from sklearn.base import TransformerMixin
    
    class SeriesImputer(TransformerMixin):
    
        def __init__(self):
            """Impute missing values.
    
            If the Series is of dtype Object, then impute with the most frequent object.
            If the Series is not of dtype Object, then impute with the mean.  
    
            """
        def fit(self, X, y=None):
            if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
            else                            : self.fill = X.mean()
            return self
    
        def transform(self, X, y=None):
           return X.fillna(self.fill)
    

    To use it you would do:

    # Make a series
    s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])
    
    
    a  = SeriesImputer()   # Initialize the imputer
    a.fit(s1)              # Fit the imputer
    s2 = a.transform(s1)   # Get a new series
    
    0 讨论(0)
  • 2020-11-30 17:39

    You can use sklearn_pandas.CategoricalImputer for the categorical columns. Details:

    First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform() takes a pandas DataFrame):

    class DataFrameSelector(BaseEstimator, TransformerMixin):
        def __init__(self, attribute_names):
            self.attribute_names = attribute_names
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return X[self.attribute_names].values
    

    You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion, for example:

    full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline)
    ])
    

    Now, in the num_pipeline you can simply use sklearn.preprocessing.Imputer(), but in the cat_pipline, you can use CategoricalImputer() from the sklearn_pandas package.

    note: sklearn-pandas package can be installed with pip install sklearn-pandas, but it is imported as import sklearn_pandas

    0 讨论(0)
  • 2020-11-30 17:41

    There is a package sklearn-pandas which has option for imputation for categorical variable https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

    >>> from sklearn_pandas import CategoricalImputer
    >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
    >>> imputer = CategoricalImputer()
    >>> imputer.fit_transform(data)
    array(['a', 'b', 'b', 'b'], dtype=object)
    
    0 讨论(0)
提交回复
热议问题