How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

后端 未结 2 1288
半阙折子戏
半阙折子戏 2021-01-19 00:39

I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search

相关标签:
2条回答
  • 2021-01-19 01:03

    In gensim/utils.py you find the method

    def save_as_line_sentence(corpus, filename):
        with smart_open(filename, mode='wb', encoding='utf8') as fout:
            for sentence in corpus:
                line = any2unicode(' '.join(sentence) + '\n')
                fout.write(line)
    

    that you can use to write the corpus into a textfile. You can override it or take it as example and and write your own version of it (maybe you want to break the lines at each punctuation) like

    def save_sentence_each_line(corpus, filename):
        with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
            for sentence in corpus:
                line = utils.any2unicode(' '.join(sentence) + '\n')
                line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
                ...
    

    you can call it like

    save_sentence_each_line(wiki.get_texts(), out_f)
    

    but you also need to override PAT_ALPHABETIC from utils, too, because thats where the punctuation gets deleted:

    PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)
    

    You may then need to override utils.tokenize and utils.simple_tokenize in case you want to make further changes to the code.

    0 讨论(0)
  • 2021-01-19 01:26

    The problem lies on your defined tokenize func

    def tokenize(content):
        return [token.encode('utf8') for token in utils.tokenize(content, 
                lower=True, errors='ignore') if len(token) <= 15 and not 
                token.startswith('_')]
    

    The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.

    For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]

    Anyway, you can define your own tokenize func as follow to retain punctuations.

    def tokenize(content):
        #override original method in wikicorpus.py
        return [token.encode('utf8') for token in content.split() 
               if len(token) <= 15 and not token.startswith('_')]
    
    0 讨论(0)
提交回复
热议问题