Spacy tokenizer, add tokenizer exception

早过忘川 提交于 2021-01-28 09:55:04

问题


Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token

('US$', 'SYM'), ('100', 'NUM')

But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER.

('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')

I was reading about the TOKENIZER_EXCEPTIONS on the documentation of spacy but I can't figure out how to this.

I was trying to use

from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS and also spacy.util which have a method update_exc().

Can someone post a full code example on how to do it?

Oh, another thing, i know that the file tokenizer_exceptions on lang.en, has already some exceptions like split "i'm" in "i" "'m", i already commented that part but that won't work. I don't want that the tokenizer split "i'm", how i can also do this ?

Thanks


回答1:


The solution is here

 def custom_en_tokenizer(en_vocab):  
 prefixes = list(English.Defaults.prefixes)
 prefixes.remove('US\$')  # Remove exception for currencies
 prefixes.append(r'(?:US)(?=\$\d+)')  # Append new prefix-matching rule

 prefix_re = util.compile_prefix_regex(tuple(prefixes))
 suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
 infix_re = util.compile_infix_regex(English.Defaults.infixes)

 return Tokenizer(en_vocab,
                  English.Defaults.tokenizer_exceptions,
                  prefix_re.search,
                  suffix_re.search,
                  infix_re.finditer,
                  token_match=None)

> tokenizer = custom_en_tokenizer(spacy.blank('en').vocab)
> for token in tokenizer('US$100'):
>      print(token, end=' ')


来源:https://stackoverflow.com/questions/47313240/spacy-tokenizer-add-tokenizer-exception

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!