Keep punctuation and casing in gensim wikicorpus text
问题 I have a Wiki Dump as xml.bz2 file and want to convert it to txt for later processing with BERT. The goal is to have each separate sentence in a new line and an emptly line between articles (requirements of BERT Training) I tried to follow this (How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?) Post and did a lot of research of my own. This is what i got so far: from __future__ import print_function import sys from gensim.corpora import WikiCorpus from