问题
Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token
('US$', 'SYM'), ('100', 'NUM')
But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER.
('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')
I was reading about the TOKENIZER_EXCEPTIONS on the documentation of spacy but I can't figure out how to this.
I was trying to use
from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS and also spacy.util which have a method update_exc().
Can someone post a full code example on how to do it?
Oh, another thing, i know that the file tokenizer_exceptions on lang.en, has already some exceptions like split "i'm" in "i" "'m", i already commented that part but that won't work. I don't want that the tokenizer split "i'm", how i can also do this ?
Thanks
回答1:
The solution is here
def custom_en_tokenizer(en_vocab):
prefixes = list(English.Defaults.prefixes)
prefixes.remove('US\$') # Remove exception for currencies
prefixes.append(r'(?:US)(?=\$\d+)') # Append new prefix-matching rule
prefix_re = util.compile_prefix_regex(tuple(prefixes))
suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
infix_re = util.compile_infix_regex(English.Defaults.infixes)
return Tokenizer(en_vocab,
English.Defaults.tokenizer_exceptions,
prefix_re.search,
suffix_re.search,
infix_re.finditer,
token_match=None)
> tokenizer = custom_en_tokenizer(spacy.blank('en').vocab)
> for token in tokenizer('US$100'):
> print(token, end=' ')
来源:https://stackoverflow.com/questions/47313240/spacy-tokenizer-add-tokenizer-exception