I am new to spaCy. I added this post for documentation and make it simple for new starters as me.
import spacy
nlp = spacy.load(\'en\')
doc = nlp(u\'KEEP CALM be
What the meaning of orth, lemma, tag and pos ?
See https://spacy.io/docs/usage/pos-tagging#pos-schemes
What the different between print(word) vs print(word.orth_)
In super short:
word.orth_
and word.text
are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user.
In short:
When you access the word.orth_
property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
(For details, see In long
below for explanation of self.c.lex.orth
)
And word.text
returns the string representation of the word which merely wraps around the orth_
property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text:
def __get__(self):
return self.orth_
And when you're printing print(word)
, it calls the __repr__
dunder function that returns the word.__unicode__
or word.__byte__
which points back to the word.text
variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
def __hash__(self):
return hash((self.doc, self.i))
def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length
def __unicode__(self):
return self.text
def __bytes__(self):
return self.text.encode('utf8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
In long:
Let's try to walk through this step by step:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
After the sentence is passed into the nlp()
function, it produces a spacy.tokens.doc.Doc object, from the docs:
cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""
So the spacy.tokens.doc.Doc
object is a sequence of spacy.tokens.token.Token object. Within the Token
object, we see a wave of cython property
enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth:
def __get__(self):
return self.c.lex.orth
Tracing it back, we see that self.c = &self.doc.c[offset]
:
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
Without thorough documentation, we don't really know what self.c
means but from the looks of it it's accessing one of the tokens within the &self.doc
reference pointing to the Doc doc
that was passed into the __cinit__
function. So most probably, it's a short cut to access the tokens
Looking at the Doc.c
:
cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
Now we see that the Doc.c
is referring to a cython pointer array data_start
that allocates the memory on to store the spacy.tokens.doc.Doc
object (please correct me if I get the explanation <TokenC*>
wrong).
So going back to self.c = &self.doc.c[offset]
, it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.
That's what spacy.tokens.token.Token
is.
Going back to the property
:
property orth:
def __get__(self):
return self.c.lex.orth
We see that the self.c.lex
is accessing the data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth
is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc
internal vocabulary.
Thus, we see the property orth_
tries to access the self.vocab.strings
with te index from self.c.lex.orth
https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
1) When you print word
, you basically print Token class from spacy which is set to print out string from the class. You can see more here. So it's different from printing out word.orth_
or word.text
where these will print out string directly.
2) I'm not sure about word.orth_
, seems like it is word.text
for most cases. For word.lemma_
, it's the lemmatize of the given word e.g. is
, am
, are
will map to be
in word.lemma_
.