spaCy Documentation for [ orth , pos , tag, lema and text ]

前端未结

关注

 2  1452

I am new to spaCy. I added this post for documentation and make it simple for new starters as me.

import spacy
nlp = spacy.load(\'en\')
doc = nlp(u\'KEEP CALM be


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2021-02-13 19:55
              
            
            
                                                                       

  What the meaning of orth, lemma, tag and pos ?


See https://spacy.io/docs/usage/pos-tagging#pos-schemes


  What the different between print(word) vs print(word.orth_)


In super short:

word.orth_ and word.text are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user. 

In short:

When you access the word.orth_ property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]


(For details, see In long below for explanation of self.c.lex.orth)

And word.text returns the string representation of the word which merely wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_


And when you're printing print(word), it calls the __repr__ dunder function that returns the word.__unicode__ or word.__byte__ which points back to the word.text variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()




In long:

Let's try to walk through this step by step:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>


After the sentence is passed into the nlp() function, it produces a spacy.tokens.doc.Doc object, from the docs:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """


So the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token object. Within the Token object, we see a wave of cython property enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
    def __get__(self):
        return self.c.lex.orth


Tracing it back, we see that self.c = &self.doc.c[offset]:

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset


Without thorough documentation, we don't really know what self.c means but from the looks of it it's accessing one of the tokens within the &self.doc reference pointing to the Doc doc that was passed into the __cinit__ function. So most probably, it's a short cut to access the tokens

Looking at the Doc.c:

cdef class Doc:
    def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
        self.vocab = vocab
        size = 20
        self.mem = Pool()
        # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
        # However, we need to remember the true starting places, so that we can
        # realloc.
        data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
        cdef int i
        for i in range(size + (PADDING*2)):
            data_start[i].lex = &EMPTY_LEXEME
            data_start[i].l_edge = i
            data_start[i].r_edge = i
        self.c = data_start + PADDING


Now we see that the Doc.c is referring to a cython pointer array data_start that allocates the memory on to store the spacy.tokens.doc.Doc object (please correct me if I get the explanation <TokenC*> wrong).

So going back to self.c = &self.doc.c[offset], it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.

That's what spacy.tokens.token.Token is.



Going back to the property:

property orth:
    def __get__(self):
        return self.c.lex.orth


We see that the self.c.lex is accessing the data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc internal vocabulary.

Thus, we see the property orth_ tries to access the self.vocab.strings with te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2021-02-13 19:59
              
            
            
                                                                       
1) When you print word, you basically print Token class from spacy which is set to print out string from the class. You can see more here. So it's different from printing out word.orth_ or word.text where these will print out string directly.

2) I'm not sure about word.orth_, seems like it is word.text for most cases. For word.lemma_, it's the lemmatize of the given word e.g. is, am, are will map to be in word.lemma_.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复