Sklearn Chi2 For Feature Selection

岁酱吖の 提交于 2019-12-06 04:02:59

问题


I'm learning about chi2 for feature selection and came across code like this

However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest, the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something other than a chi2 statistic?

See code below for what I mean (mostly copied from above link except for the end)

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

回答1:


Your understanding is reversed.

The null hypothesis for chi2 test is that "two categorical variables are independent". So a higher value of chi2 statistic means "two categorical variables are dependent" and MORE USEFUL for classification.

SelectKBest gives you the best two (k=2) features based on higher chi2 values. Thus you need to get those features that it gives, rather that getting the "other features" on the chi2 selector.

You are correct to get the chi2 statistic from chi2_selector.scores_ and the best features from chi2_selector.get_support(). It will give you 'petal length (cm)' and 'petal width (cm)' as top 2 features based on chi2 test of independence test. Hope it clarifies this algorithm.



来源:https://stackoverflow.com/questions/51695769/sklearn-chi2-for-feature-selection

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!