How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

后端未结

关注

 2  1288

I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search

相关标签:

2条回答

长情又很酷

2021-01-19 01:03

In gensim/utils.py you find the method

def save_as_line_sentence(corpus, filename):
    with smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = any2unicode(' '.join(sentence) + '\n')
            fout.write(line)

that you can use to write the corpus into a textfile. You can override it or take it as example and and write your own version of it (maybe you want to break the lines at each punctuation) like

def save_sentence_each_line(corpus, filename):
    with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = utils.any2unicode(' '.join(sentence) + '\n')
            line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
            ...

you can call it like

save_sentence_each_line(wiki.get_texts(), out_f)

but you also need to override PAT_ALPHABETIC from utils, too, because thats where the punctuation gets deleted:

PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)

You may then need to override utils.tokenize and utils.simple_tokenize in case you want to make further changes to the code.

0 讨论(0)

我在风中等你

2021-01-19 01:26
The problem lies on your defined tokenize func
```
def tokenize(content):
    return [token.encode('utf8') for token in utils.tokenize(content, 
            lower=True, errors='ignore') if len(token) <= 15 and not 
            token.startswith('_')]
```
The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.

For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]

Anyway, you can define your own tokenize func as follow to retain punctuations.
```
def tokenize(content):
    #override original method in wikicorpus.py
    return [token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...