自然语言踩过的坑：doc2bow expects an array of unicode tokens on input, not a single string

原代码：

title = response.meta['title']         #print title         content = response.meta['content']         #print content         raw_documents = []         raw_documents.append(title)         raw_documents.append(content)         #print raw_documents         print raw_documents[0]         print raw_documents[1]         corpora_documents = []         # 分词处理         for item_text in raw_documents:             item_seg = list(jieba.cut(item_text))             #print item_seg              '''建立停用词'''             #stopwords = {}.fromkeys(['。', '：', '，',' ','《','》','、',' ','（','）','“','”','；','\n'])             buff = []             with codecs.open('stop.txt') as fp:                 for ln in fp:                     el = ln[:-2]                     buff.append(el)             stopwords = buff             for word in item_seg:                 if word not in stopwords and len(word)>1:                     print word                     corpora_documents.append(word)             print corpora_documents         # 生成字典和向量语料         dictionary = corpora.Dictionary(corpora_documents)

报错：

很显然我们可以看到问题是出在最后一行，经过翻译，得知：

所以我们将最后一行改为：

        dictionary = corpora.Dictionary([corpora_documents])

也就是说我们要把它变成Unicode编码，而不是单个字符，加个【】，使其变成列表即可

然后在进行运行，OK啦！

文章来源: 自然语言踩过的坑：doc2bow expects an array of unicode tokens on input, not a single string

标签

自然语言

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!