I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just a
sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10, 20, 30)
}
here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print ('best score: %0.3f' % grid_search.best_score_)
print ('best parameters set:')
bestParameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print ('\t %s: %r' % (param_name, bestParameters[param_name]))
predictions = grid_search.predict(X_test)
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.
Hope this helps.
Edit: To get a list of features selected
so once you have your best set of parameters, create vectorizers and classifiers with those parameter values
vect = TfidfVectorizer('''use the best parameters here''')
then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
termDocMatrix = vect.fit_transform(X_train, y_train)
now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score
getKbest = SelectKBest(chi2, k = 100)
now just
print(np.asarray(vect.get_feature_names())[getKbest.get_support()])
should give you the top 100 features. try this.