feature selection using logistic regression

后端 未结 1 369
隐瞒了意图╮
隐瞒了意图╮ 2021-01-16 03:00

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just a

相关标签:
1条回答
  • 2021-01-16 03:38

    sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code

    pipeline = Pipeline([
        ('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
        ('clf', LogisticRegression())
        ])
    
        parameters = {
            'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
            'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
            'vect__use_idf': (True, False),
            'clf__C': (0.1, 1, 10, 20, 30)
        }
    

    here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,

    'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
    

    actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on

        traindf = pd.read_json('../../data/train.json')
    
        traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  
    
        traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       
    
        X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
    
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
        grid_search.fit(X_train, y_train)
    
        print ('best score: %0.3f' % grid_search.best_score_)
        print ('best parameters set:')
    
        bestParameters = grid_search.best_estimator_.get_params()
    
        for param_name in sorted(parameters.keys()):
            print ('\t %s: %r' % (param_name, bestParameters[param_name]))
    
        predictions = grid_search.predict(X_test)
        print ('Accuracy:', accuracy_score(y_test, predictions))
        print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
        print ('Classification Report:', classification_report(y_test, predictions))
    

    note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.

    Hope this helps.

    Edit: To get a list of features selected

    so once you have your best set of parameters, create vectorizers and classifiers with those parameter values

    vect = TfidfVectorizer('''use the best parameters here''')
    

    then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.

    traindf = pd.read_json('../../data/train.json')
    
            traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  
    
            traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       
    
            X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
    
            X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
    
           termDocMatrix = vect.fit_transform(X_train, y_train)
    

    now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score

    getKbest = SelectKBest(chi2, k = 100)
    

    now just

    print(np.asarray(vect.get_feature_names())[getKbest.get_support()])
    

    should give you the top 100 features. try this.

    0 讨论(0)
提交回复
热议问题