Retain feature names after Scikit Feature Selection

后端 未结 5 1404
感情败类
感情败类 2021-02-07 13:16

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I\'m doing something simple yet stupid, but I\'d like to retain th

相关标签:
5条回答
  • 2021-02-07 13:51

    Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.

    >>> df
       Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
    0         0       3    1   22      1      0         0
    1         1       1    2   38      1      0         0
    2         1       3    2   26      0      0         0
    
    >>> from sklearn.feature_selection import VarianceThreshold
    >>> def variance_threshold_selector(data, threshold=0.5):
        selector = VarianceThreshold(threshold)
        selector.fit(data)
        return data[data.columns[selector.get_support(indices=True)]]
    
    >>> variance_threshold_selector(df, 0.5)
       Pclass  Age
    0       3   22
    1       1   38
    2       3   26
    >>> variance_threshold_selector(df, 0.9)
       Age
    0   22
    1   38
    2   26
    >>> variance_threshold_selector(df, 0.1)
       Survived  Pclass  Sex  Age  SibSp
    0         0       3    1   22      1
    1         1       1    2   38      1
    2         1       3    2   26      0
    
    0 讨论(0)
  • 2021-02-07 13:52

    As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.

    def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
        df1 = df.copy(deep=True) # Make a deep copy of the dataframe
        selector = VarianceThreshold(thresh)
        selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
        df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values
    
        return df2
    
    0 讨论(0)
  • 2021-02-07 13:54

    You can use Pandas for thresholding too

    data_new = data.loc[:, data.std(axis=0) > 0.75]
    
    0 讨论(0)
  • 2021-02-07 13:59

    There's probably better ways to do this, but for those interested here's how I did:

    def VarianceThreshold_selector(data):
    
        #Select Model
        selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples
    
        #Fit the Model
        selector.fit(data)
        features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
        features = [column for column in data[features]] #Array of all nonremoved features' names
    
        #Format and Return
        selector = pd.DataFrame(selector.transform(data))
        selector.columns = features
        return selector
    
    0 讨论(0)
  • 2021-02-07 14:04

    I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.

    However, you can subset the data a bit more cleanly like this:

    data_transformed = data.loc[:, selector.get_support()]
    
    0 讨论(0)
提交回复
热议问题