Python text processing: NLTK and pandas

前端 未结 1 983
半阙折子戏
半阙折子戏 2020-12-28 21:35

I\'m looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other att

相关标签:
1条回答
  • 2020-12-28 22:20

    The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

    word_file = "/usr/share/dict/words"
    words = open(word_file).read().splitlines()[10:50]
    random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
    
    df = pd.DataFrame(random_word_list, columns=['text'])
    df.head()
    
                                                    text
    0  Aaru Aaronic abandonable abandonedly abaction ...
    1  abampere abampere abacus aback abalone abactor...
    2  abaisance abalienate abandonedly abaff abacina...
    3  Ababdeh abalone abac abaiser abandonable abact...
    4  abandonable abandon aba abaiser abaft Abama ab...
    
    len(df)
    
    50
    
    txt = df.text.apply(word_tokenize)
    txt.head()
    
    0    [Aaru, Aaronic, abandonable, abandonedly, abac...
    1    [abampere, abampere, abacus, aback, abalone, a...
    2    [abaisance, abalienate, abandonedly, abaff, ab...
    3    [Ababdeh, abalone, abac, abaiser, abandonable,...
    4    [abandonable, abandon, aba, abaiser, abaft, Ab...
    
    txt.apply(len)
    
    0     1000
    1     1000
    2     1000
    3     1000
    4     1000
    ....
    44    1000
    45    1000
    46    1000
    47    1000
    48    1000
    49    1000
    Name: text, dtype: int64
    

    As a result, you get the .count() for each row entry:

    txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
    txt.head()
    
    0    27
    1    24
    2    17
    3    25
    4    32
    

    You can then sum the result using:

    txt.sum()
    
    1239
    
    0 讨论(0)
提交回复
热议问题