问题
I have the following main.py
.
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import nltk
import string
import sys
for token in nltk.word_tokenize(''.join(sys.stdin.readlines())):
#print token
if len(token) == 1 and not token in string.punctuation or len(token) > 1:
print token
The output is the following.
./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts'
EGR1
-/-
mouse
embryonic
fibroblasts
I want to slightly change the tokenizer so that it will recognize EGR1(-/-)
as one token (without any other changes). Does anybody know if there is a such way to slighly modify the tokenizer? Thanks.
回答1:
The default word_tokenize()
function in NLTK
is TreebankWordTokenizer that is based on a sequence of regex substitution.
More specifically, when it comes to adding spaces between parenthesis, the TreebankWordTokenizer
uses this regex substitutions:
PARENS_BRACKETS = [
(re.compile(r'[\]\[\(\)\{\}\<\>]'), r' \g<0> '),
(re.compile(r'--'), r' -- '),
]
for regexp, substitution in self.PARENS_BRACKETS:
text = regexp.sub(substitution, text)
For example:
import re
text = 'EGR1(-/-) mouse embryonic fibroblasts'
PARENS_BRACKETS = [
(re.compile(r'[\]\[\(\)\{\}\<\>]'), r' \g<0> '),
(re.compile(r'--'), r' -- '),
]
for regexp, substitution in PARENS_BRACKETS:
text = regexp.sub(substitution, text)
print text
[out]:
EGR1 ( -/- ) mouse embryonic fibroblasts
So going back to "hacking" the NLTK word_tokenize()
function, you can try something like this to cancel the effects of the PARENS_BRACKETS
substitutions:
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.PARENS_BRACKETS = []
>>> text = 'EGR1(-/-) mouse embryonic fibroblasts'
>>> tokenizer.tokenize(text)
['EGR1(-/-)', 'mouse', 'embryonic', 'fibroblasts']
来源:https://stackoverflow.com/questions/37108656/modify-nltk-word-tokenize-to-prevent-tokenization-of-parenthesis