Retain feature names after Scikit Feature Selection

后端未结

关注

 5  1404

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I\'m doing something simple yet stupid, but I\'d like to retain th

相关标签:

5条回答

情书的邮戳

2021-02-07 13:51

Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0

0 讨论(0)

小蘑菇

2021-02-07 13:52

As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.

def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
    df1 = df.copy(deep=True) # Make a deep copy of the dataframe
    selector = VarianceThreshold(thresh)
    selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
    df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values

    return df2

0 讨论(0)

死守一世寂寞

2021-02-07 13:54
You can use Pandas for thresholding too
```
data_new = data.loc[:, data.std(axis=0) > 0.75]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2021-02-07 13:59

There's probably better ways to do this, but for those interested here's how I did:

def VarianceThreshold_selector(data):

    #Select Model
    selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples

    #Fit the Model
    selector.fit(data)
    features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
    features = [column for column in data[features]] #Array of all nonremoved features' names

    #Format and Return
    selector = pd.DataFrame(selector.transform(data))
    selector.columns = features
    return selector

0 讨论(0)

你的背包

2021-02-07 14:04
I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.

However, you can subset the data a bit more cleanly like this:
```
data_transformed = data.loc[:, selector.get_support()]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...