Spacy tokenizer, add tokenizer exception

问题

Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token

('US$', 'SYM'), ('100', 'NUM')

But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER.

('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')

I was reading about the TOKENIZER_EXCEPTIONS on the documentation of spacy but I can't figure out how to this.

I was trying to use

from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS and also spacy.util which have a method update_exc().

Can someone post a full code example on how to do it?

Oh, another thing, i know that the file tokenizer_exceptions on lang.en, has already some exceptions like split "i'm" in "i" "'m", i already commented that part but that won't work. I don't want that the tokenizer split "i'm", how i can also do this ?

Thanks

回答1:

The solution is here

 def custom_en_tokenizer(en_vocab):  
 prefixes = list(English.Defaults.prefixes)
 prefixes.remove('US\$')  # Remove exception for currencies
 prefixes.append(r'(?:US)(?=\$\d+)')  # Append new prefix-matching rule

 prefix_re = util.compile_prefix_regex(tuple(prefixes))
 suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
 infix_re = util.compile_infix_regex(English.Defaults.infixes)

 return Tokenizer(en_vocab,
                  English.Defaults.tokenizer_exceptions,
                  prefix_re.search,
                  suffix_re.search,
                  infix_re.finditer,
                  token_match=None)

> tokenizer = custom_en_tokenizer(spacy.blank('en').vocab)
> for token in tokenizer('US$100'):
>      print(token, end=' ')

来源：https://stackoverflow.com/questions/47313240/spacy-tokenizer-add-tokenizer-exception

标签

nlp

tokenize

spacy