word_tokenize TypeError: expected string or buffer [closed]

孤街醉人 提交于 2019-12-24 12:42:35

问题


When calling word_tokenize I get the following error:

File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
    in _slices_from_text for match in
    self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

I have a large text file (1500.txt) from which I want to remove stop words. My code is as follows:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(File_1500)
    filtered_sentence = [w for w in words if not w in stop_words]
    print(filtered_sentence)

回答1:


The input for word_tokenize is a document stream sentence, i.e. a list of strings, e.g. ['this is sentence 1.', 'that's sentence 2!'].

The File_1500 is a File object not a list of strings, that's why it's not working.

To get a list of sentence strings, first you have to read the file as a string object fin.read(), then use sent_tokenize to split the sentence up (I'm assuming that your input file is not sentence tokenized, just a raw textfile).

Also, it's better / more idiomatic to tokenize a file this way with NLTK:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words("english"))

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
    for sent in sent_tokenize(fin.read()):
        words = word_tokenize(sent)
        filtered_sentence = [w for w in words if not w in stop_words]
        print(filtered_sentence)


来源:https://stackoverflow.com/questions/33773157/word-tokenize-typeerror-expected-string-or-buffer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!