Find top n terms with highest TF-IDF score per class

前端 未结 3 1407
南笙
南笙 2021-01-23 05:59

Let\'s suppose that I have a dataframe with two columns in pandas which resembles the following one:

    text                                label
0         


        
相关标签:
3条回答
  • 2021-01-23 06:13
    top_terms = pd.DataFrame(columns = range(1,6))
    
    for i in term_doc_mat.index:
        top_terms.loc[len(top_terms)] = term_doc_mat.loc[i].sort_values(ascending = False)[0:5].index
    
    

    This will give you the top 5 terms for each document. Adjust as needed.

    0 讨论(0)
  • 2021-01-23 06:19

    In the following, you can find a piece of code I wrote more than three years ago for a similar purpose. I'm not sure if this is the most efficient way of doing what you're going to do, but as far as I remember, it worked for me.

    # X: data points
    # y: targets (data points` label)
    # vectorizer: TFIDF vectorizer created by sklearn
    # n: number of features that we want to list for each class
    # target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
    # --------------------------------------------
    # splitting X vectors based on target classes
    for label in target_list:
        # listing the most important words in each class
        indices = []
        current_dict = {}
    
        # finding indices the of rows (data points) for the current class
        for i in range(0, len(X.toarray())):
            if y[i] == label:
                indices.append(i)
    
        # get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
        vectors = np.mean(X[indices, :], axis=0)
    
        # creating a dictionary of features with their corresponding values
        for i in range(0, X.shape[1]):
            current_dict[X.indices[i]] = vectors.item((0, i))
    
        # sorting the dictionary based on values
        sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)
    
        # printing the features textual and numeric values
        index = 1
        for element in sorted_dict:
            for key_, value_ in vectorizer.vocabulary_.items():
                if element[0] == value_:
                    print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
                    index += 1
                    if index == n:
                        break
            else:
                continue
            break
    
    0 讨论(0)
  • 2021-01-23 06:31

    The following code will do the work (thanks to Mariia Havrylovych).

    Assume we have an input dataframe, df, aligned with your structure.

    from sklearn.feature_extraction.text import TfidfVectorizer
    import pandas as pd
    
    # override scikit's tfidf-vectorizer in order to return dataframe with feature names as columns
    class DenseTfIdf(TfidfVectorizer):
    
        def __init__(self, **kwargs):
            super().__init__(**kwargs)
            for k, v in kwargs.items():
                setattr(self, k, v)
    
        def transform(self, x, y=None) -> pd.DataFrame:
            res = super().transform(x)
            df = pd.DataFrame(res.toarray(), columns=self.get_feature_names())
            return df
    
        def fit_transform(self, x, y=None) -> pd.DataFrame:
            # run sklearn's fit_transform
            res = super().fit_transform(x, y=y)
            # convert the returned sparse documents-terms matrix into a dataframe to further manipulations
            df = pd.DataFrame(res.toarray(), columns=self.get_feature_names(), index=x.index)
            return df
    

    Usage:

    # assume texts are stored in column 'text' within a dataframe
    texts = df['text']
    df_docs_terms_corpus = DenseTfIdf(sublinear_tf=True,
                     max_df=0.5,
                     min_df=2,
                     encoding='ascii',
                     ngram_range=(1, 2),
                     lowercase=True,
                     max_features=1000,
                     stop_words='english'
                    ).fit_transform(texts)
    
    
    # Need to keep alignment of indexes between the original dataframe and the resulted documents-terms dataframe
    df_class = df[df["label"] == "Class XX"]
    df_docs_terms_class = df_docs_terms_corpus.iloc[df_class.index]
    # sum by columns and get the top n keywords
    df_docs_terms_class.sum(axis=0).nlargest(n=50)
    
    0 讨论(0)
提交回复
热议问题