spaCy Documentation for [ orth , pos , tag, lema and text ]

前端 未结 2 1447
夕颜
夕颜 2021-02-13 19:34

I am new to spaCy. I added this post for documentation and make it simple for new starters as me.

import spacy
nlp = spacy.load(\'en\')
doc = nlp(u\'KEEP CALM be         


        
相关标签:
2条回答
  • 2021-02-13 19:55

    What the meaning of orth, lemma, tag and pos ?

    See https://spacy.io/docs/usage/pos-tagging#pos-schemes

    What the different between print(word) vs print(word.orth_)

    In super short:

    word.orth_ and word.text are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user.

    In short:

    When you access the word.orth_ property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:

    property orth_:
            def __get__(self):
                return self.vocab.strings[self.c.lex.orth]
    

    (For details, see In long below for explanation of self.c.lex.orth)

    And word.text returns the string representation of the word which merely wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

    property text:
        def __get__(self):
            return self.orth_
    

    And when you're printing print(word), it calls the __repr__ dunder function that returns the word.__unicode__ or word.__byte__ which points back to the word.text variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

    cdef class Token:
        """
        An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
        """
        def __cinit__(self, Vocab vocab, Doc doc, int offset):
            self.vocab = vocab
            self.doc = doc
            self.c = &self.doc.c[offset]
            self.i = offset
    
        def __hash__(self):
            return hash((self.doc, self.i))
    
        def __len__(self):
            """
            Number of unicode characters in token.text.
            """
            return self.c.lex.length
    
        def __unicode__(self):
            return self.text
    
        def __bytes__(self):
            return self.text.encode('utf8')
    
        def __str__(self):
            if is_config(python3=True):
                return self.__unicode__()
            return self.__bytes__()
    
        def __repr__(self):
            return self.__str__()
    

    In long:

    Let's try to walk through this step by step:

    >>> import spacy
    >>> nlp = spacy.load('en')
    >>> doc = nlp(u'This is a foo bar sentence.')
    >>> type(doc)
    <type 'spacy.tokens.doc.Doc'>
    

    After the sentence is passed into the nlp() function, it produces a spacy.tokens.doc.Doc object, from the docs:

    cdef class Doc:
        """
        A sequence of `Token` objects. Access sentences and named entities,
        export annotations to numpy arrays, losslessly serialize to compressed
        binary strings.
        Aside: Internals
            The `Doc` object holds an array of `TokenC` structs.
            The Python-level `Token` and `Span` objects are views of this
            array, i.e. they don't own the data themselves.
        Code: Construction 1
            doc = nlp.tokenizer(u'Some text')
        Code: Construction 2
            doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
        """
    

    So the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token object. Within the Token object, we see a wave of cython property enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

    property orth:
        def __get__(self):
            return self.c.lex.orth
    

    Tracing it back, we see that self.c = &self.doc.c[offset]:

    cdef class Token:
        """
        An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
        """
        def __cinit__(self, Vocab vocab, Doc doc, int offset):
            self.vocab = vocab
            self.doc = doc
            self.c = &self.doc.c[offset]
            self.i = offset
    

    Without thorough documentation, we don't really know what self.c means but from the looks of it it's accessing one of the tokens within the &self.doc reference pointing to the Doc doc that was passed into the __cinit__ function. So most probably, it's a short cut to access the tokens

    Looking at the Doc.c:

    cdef class Doc:
        def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
            self.vocab = vocab
            size = 20
            self.mem = Pool()
            # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
            # However, we need to remember the true starting places, so that we can
            # realloc.
            data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
            cdef int i
            for i in range(size + (PADDING*2)):
                data_start[i].lex = &EMPTY_LEXEME
                data_start[i].l_edge = i
                data_start[i].r_edge = i
            self.c = data_start + PADDING
    

    Now we see that the Doc.c is referring to a cython pointer array data_start that allocates the memory on to store the spacy.tokens.doc.Doc object (please correct me if I get the explanation <TokenC*> wrong).

    So going back to self.c = &self.doc.c[offset], it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.

    That's what spacy.tokens.token.Token is.


    Going back to the property:

    property orth:
        def __get__(self):
            return self.c.lex.orth
    

    We see that the self.c.lex is accessing the data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc internal vocabulary.

    Thus, we see the property orth_ tries to access the self.vocab.strings with te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

    property orth_:
            def __get__(self):
                return self.vocab.strings[self.c.lex.orth]
    
    0 讨论(0)
  • 2021-02-13 19:59

    1) When you print word, you basically print Token class from spacy which is set to print out string from the class. You can see more here. So it's different from printing out word.orth_ or word.text where these will print out string directly.

    2) I'm not sure about word.orth_, seems like it is word.text for most cases. For word.lemma_, it's the lemmatize of the given word e.g. is, am, are will map to be in word.lemma_.

    0 讨论(0)
提交回复
热议问题