原代码:
title = response.meta['title'] #print title content = response.meta['content'] #print content raw_documents = [] raw_documents.append(title) raw_documents.append(content) #print raw_documents print raw_documents[0] print raw_documents[1] corpora_documents = [] # 分词处理 for item_text in raw_documents: item_seg = list(jieba.cut(item_text)) #print item_seg '''建立停用词''' #stopwords = {}.fromkeys(['。', ':', ',',' ','《','》','、',' ','(',')','“','”',';','\n']) buff = [] with codecs.open('stop.txt') as fp: for ln in fp: el = ln[:-2] buff.append(el) stopwords = buff for word in item_seg: if word not in stopwords and len(word)>1: print word corpora_documents.append(word) print corpora_documents # 生成字典和向量语料 dictionary = corpora.Dictionary(corpora_documents)
报错:
很显然我们可以看到问题是出在最后一行,经过翻译,得知:
所以我们将最后一行改为:
dictionary = corpora.Dictionary([corpora_documents])
也就是说我们要把它变成Unicode编码,而不是单个字符,加个【】,使其变成列表即可
然后在进行运行,OK啦!