How to find ngram frequency of a column in a pandas dataframe?

后端 未结 1 1843
一生所求
一生所求 2020-12-28 10:24

Below is the input pandas dataframe I have.

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below

H

相关标签:
1条回答
  • 2020-12-28 11:02

    If your data is like

    import pandas as pd
    df = pd.DataFrame([
        'must watch. Good acting',
        'average movie. Bad acting',
        'good movie. Good acting',
        'pathetic. Avoid',
        'avoid'], columns=['description'])
    

    You could use the CountVectorizer of the package sklearn:

    from sklearn.feature_extraction.text import CountVectorizer
    word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(df['description'])
    frequencies = sum(sparse_matrix).toarray()[0]
    pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
    

    Which gives you :

                    frequency
    good            3
    pathetic        1
    average movie   1
    movie bad       2
    watch           1
    good movie      1
    watch good      3
    good acting     2
    must            1
    movie good      2
    pathetic avoid  1
    bad acting      1
    average         1
    must watch      1
    acting          1
    bad             1
    movie           1
    avoid           1
    

    EDIT

    fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

    Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

    EDIT 2

    For big dataframes, you can speed up the frequencies computation by using:

    frequencies = sum(sparse_matrix).data
    
    0 讨论(0)
提交回复
热议问题