Impute categorical missing values in scikit-learn

后端 未结 10 1356
清歌不尽
清歌不尽 2020-11-30 16:55

I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle

相关标签:
10条回答
  • 2020-11-30 17:43

    Similar. Modify Imputer for strategy='most_frequent':

    class GeneralImputer(Imputer):
        def __init__(self, **kwargs):
            Imputer.__init__(self, **kwargs)
    
        def fit(self, X, y=None):
            if self.strategy == 'most_frequent':
                self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
                self.statistics_ = self.fills.values
                return self
            else:
                return Imputer.fit(self, X, y=y)
    
        def transform(self, X):
            if hasattr(self, 'fills'):
                return pd.DataFrame(X).fillna(self.fills).values.astype(str)
            else:
                return Imputer.transform(self, X)
    

    where pandas.DataFrame.mode() finds the most frequent value for each column and then pandas.DataFrame.fillna() fills missing values with these. Other strategy values are still handled the same way by Imputer.

    0 讨论(0)
  • 2020-11-30 17:45

    This code fills in a series with the most frequent category:

    import pandas as pd
    import numpy as np
    
    # create fake data 
    m = pd.Series(list('abca'))
    m.iloc[1] = np.nan #artificially introduce nan
    
    print('m = ')
    print(m)
    
    #make dummy variables, count and sort descending:
    most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 
    
    def replace_most_common(x):
        if pd.isnull(x):
            return most_common
        else:
            return x
    
    new_m = m.map(replace_most_common) #apply function to original data
    
    print('new_m = ')
    print(new_m)
    

    Outputs:

    m =
    0      a
    1    NaN
    2      c
    3      a
    dtype: object
    
    new_m =
    0    a
    1    a
    2    c
    3    a
    dtype: object
    
    0 讨论(0)
  • 2020-11-30 17:45

    You could try the following:

    replace = df.<yourcolumn>.value_counts().argmax()
    
    df['<yourcolumn>'].fillna(replace, inplace=True) 
    
    
    0 讨论(0)
  • 2020-11-30 17:46
    • strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.

          from sklearn.preprocessing import Imputer
          impute = Imputer(strategy='mean')
          for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
                xx[cols] = impute.fit_transform(xx[[cols]])
      
    • Custom Imputer :

         from sklearn.preprocessing import Imputer
         from sklearn.base import TransformerMixin
      
         class CustomImputer(TransformerMixin):
               def __init__(self, cols=None, strategy='mean'):
                     self.cols = cols
                     self.strategy = strategy
      
               def transform(self, df):
                     X = df.copy()
                     impute = Imputer(strategy=self.strategy)
                     if self.cols == None:
                            self.cols = list(X.columns)
                     for col in self.cols:
                            if X[col].dtype == np.dtype('O') : 
                                   X[col].fillna(X[col].value_counts().index[0], inplace=True)
                            else : X[col] = impute.fit_transform(X[[col]])
      
                     return X
      
               def fit(self, *_):
                     return self
      
    • Dataframe:

            X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                       francisco', 'tokyo'], 
                'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
                'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                  'somewhat like', 'dislike'], 
                'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
      
      
                  city              boolean   ordinal_column  quantitative_column
              0   tokyo             yes       somewhat like   1.0
              1   NaN               no        like            11.0
              2   london            NaN       somewhat like   -0.5
              3   seattle           no        like            10.0
              4   san francisco     no        somewhat like   NaN
              5   tokyo             yes       dislike         20.0
      
    • 1) Can be used with list of similar type of features.

       cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
       cci.fit_transform(X)
      
    • can be used with strategy = median

       sd = CustomImputer(['quantitative_column'], strategy = 'median')
       sd.fit_transform(X)
      
    • 3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.

       call = CustomImputer()
       call.fit_transform(X)   
      
    0 讨论(0)
提交回复
热议问题