Count most frequent 100 words from sentences in Dataframe Pandas

前端 未结 3 875
清酒与你
清酒与你 2020-12-05 03:33

I have text reviews in one column in Pandas dataframe and I want to count the N-most frequent words with their frequency counts (in whole column - NOT in single cell). One a

相关标签:
3条回答
  • 2020-12-05 03:51
    from collections import Counter
    Counter(" ".join(df["text"]).split()).most_common(100)
    

    im pretty sure would give you what you want (you might have to remove some non-words from the counter result before calling most_common)

    0 讨论(0)
  • 2020-12-05 03:55

    I'm going to have to disagree with @Zero

    For 91,000 strings (email address), I found collections.Counter(..).most_common(n) to be faster. however, series.value_counts may still be faster at if they are over 500k words

    %%timeit
    [i[0] for i in Counter(data_requester['requester'].values).most_common(5)]
    # 13 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit
    data_requester['requester'].value_counts().index[:5]
    # 22.2 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    0 讨论(0)
  • 2020-12-05 04:11

    Along with @Joran's solution you could also you use series.value_counts for large amounts of text/rows

     pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]
    

    You would find from the benchmarks series.value_counts seems twice (2X) faster than Counter method

    For Movie Reviews dataset of 3000 rows, totaling 400K characters and 70k words.

    In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
    10 loops, best of 3: 44.2 ms per loop
    
    In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
    10 loops, best of 3: 27.1 ms per loop
    
    0 讨论(0)
提交回复
热议问题