SpaCy Version: 2.0.11
Python Version: 3.6.5
OS: Ubuntu 16.04
My Sentence Samples:
Marketing-Representative- won\'t die in car accident.
Although not documented at spacey
usage site ,
It looks like that we just need to add regex
for *fix we are working with, in this case infix.
Also, it appears we can extend nlp.Defaults.prefixes
with custom regex
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
This will give you desired result. There is no need set default to prefix
and suffix
since we are not working with those.
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
Result
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.