NLP in Python: Obtain word names from SelectKBest after vectorizing

前端 未结 2 477
野的像风
野的像风 2021-01-14 11:16

I can\'t seem to find an answer to my exact problem. Can anyone help?

A simplified description of my dataframe (\"df\"): It has 2 columns: one is a bunch of text (\"

相关标签:
2条回答
  • 2021-01-14 12:03

    I had a similar problem recently, but I was not constricted to using the 20 most relevant words. Rather, I could select the words which had a chi score higher than a set threshold. I will give you the method I used to achieve this second task. The reason why this is preferable than just using the first n words accordingly to their chi-score, is that those 20 words may have an extremely low score and thus contribute next to nothing to the classification task.

    Here is how I have done it for a binary classification task:

        import pandas as pd
        import numpy as np
        from sklearn.feature_extraction.text import CountVectorizer
        from sklearn.feature_selection import chi2
    
        THRESHOLD_CHI = 5 # or whatever you like. You may try with
         # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
         # and measure the f1 scores
    
        X = df['text']
        y = df['labels']
    
        cv = CountVectorizer()
        cv_sparse_matrix = cv.fit_transform(X)
        cv_dense_matrix = cv_sparse_matrix.todense()
    
        chi2_stat, pval = chi2(cv_dense_matrix, y)
    
        chi2_reshaped = chi2_stat.reshape(1,-1)
        which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
        which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
    

    The result is a matrix containing ones where the terms have a chi score higher than the threshold, and zeroes where they have a chi score lower than the threshold. This matrix can then be np.dot with either a cv matrix or a tfidf matrix, and subsequently passed to the fit method of a classifier.

    If you do this, the columns of the matrix which_ones_to_keep correspond to the columns of the CountVectorizer object, and you can thus determine which terms were relevant for the given labels by comparing the non-zero columns of the which_ones_to_keep matrix to the indices of the .get_feature_names(), or you can just forget about it and pass it directly to a classifier.

    0 讨论(0)
  • 2021-01-14 12:09

    After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.

    Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(lowercase=True,stop_words='english')
    X = vectorizer.fit_transform(df["Notes"])
    
    from sklearn.feature_selection import chi2
    chi2score = chi2(X,df['AboveAverage'])[0]
    
    wscores = zip(vectorizer.get_feature_names(),chi2score)
    wchi2 = sorted(wscores,key=lambda x:x[1]) 
    topchi2 = zip(*wchi2[-20:])
    show=list(topchi2)
    show
    

    Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import SelectKBest, chi2
    
    vectorizer = CountVectorizer(lowercase=True,stop_words='english')
    X = vectorizer.fit_transform(df["Notes"])
    
    y = df['AboveAverage']
    
    # Select 10 features with highest chi-squared statistics
    chi2_selector = SelectKBest(chi2, k=10)
    chi2_selector.fit(X, y)
    
    # Look at scores returned from the selector for each feature
    chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), 
                                           columns=['ftr', 'score', 'pval'])
    chi2_scores
    
    0 讨论(0)
提交回复
热议问题