I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search
In gensim/utils.py you find the method
def save_as_line_sentence(corpus, filename):
with smart_open(filename, mode='wb', encoding='utf8') as fout:
for sentence in corpus:
line = any2unicode(' '.join(sentence) + '\n')
fout.write(line)
that you can use to write the corpus into a textfile. You can override it or take it as example and and write your own version of it (maybe you want to break the lines at each punctuation) like
def save_sentence_each_line(corpus, filename):
with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
for sentence in corpus:
line = utils.any2unicode(' '.join(sentence) + '\n')
line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
...
you can call it like
save_sentence_each_line(wiki.get_texts(), out_f)
but you also need to override PAT_ALPHABETIC from utils, too, because thats where the punctuation gets deleted:
PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)
You may then need to override utils.tokenize and utils.simple_tokenize in case you want to make further changes to the code.
The problem lies on your defined tokenize func
def tokenize(content):
return [token.encode('utf8') for token in utils.tokenize(content,
lower=True, errors='ignore') if len(token) <= 15 and not
token.startswith('_')]
The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.
For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]
Anyway, you can define your own tokenize func as follow to retain punctuations.
def tokenize(content):
#override original method in wikicorpus.py
return [token.encode('utf8') for token in content.split()
if len(token) <= 15 and not token.startswith('_')]