Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

后端未结

关注

 1  1562

眼角桃花

SpaCy Version: 2.0.11

Python Version: 3.6.5

OS: Ubuntu 16.04

My Sentence Samples:

Marketing-Representative- won\'t die in car accident.


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2020-12-21 22:18
              
            
            
                                                                       
Although not documented at spacey usage site ,  

It looks like that we just need to add regex for *fix we are working with, in this case infix.

Also, it appears we can extend nlp.Defaults.prefixes with custom regex  

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")


This will give you desired result. There is no need set default to prefix and suffix since we are not working with those.  

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])


Result  

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']  


You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复