问题
I try to do typo correction with spaCy, and for that I need to know if a word exists in the vocab or not. If not, the idea is to split the word in two until all segments do exist. As example, "ofthe" does not exist, "of" and "the" do. So I first need to know if a word exists in the vocab. That's where the problems start. I try:
for token in nlp("apple"):
print(token.lemma_, token.lemma, token.is_oov, "apple" in nlp.vocab)
apple 8566208034543834098 True True
for token in nlp("andshy"):
print(token.lemma_, token.lemma, token.is_oov, "andshy" in nlp.vocab)
andshy 4682930577439079723 True True
It's clear that this make no sense, in both cases "is_oov" is True, and it is in the vocabulary. I'm looking for something simple like
"andshy" in nlp.vocab = False, "andshy".is_oov = True
"apple" in nlp.vocab = True, "apple".is_oov = False
And in the next step, also some word correction method. I can use the spellchecker library, but that's not consistent with the spaCy vocab
This problem appears to be a frequent question, and any suggestions (code) are most welcome.
thanks,
AHe
回答1:
Short answer: spacy's models do not contain any word lists that are suitable for spelling correction.
Longer answer:
Spacy's vocab
is not a fixed list of words in a particular language. It is just a cache with lexical information about tokens that have been seen during training and processing. Checking whether a token is in nlp.vocab
just checks whether a token is in this cache, so it's is not a useful check for spelling correction.
Token.is_oov
has a more specific meaning that's not obvious from its short description in the docs: it reports whether the model contains some additional lexical information about this token like Token.prob
. For a small spacy model like en_core_web_sm
that doesn't contain any probabilities, is_oov
will be True
for all tokens by default. The md
and lg
models contain lexical information about 1M+ tokens and the word vectors contain 600K+ tokens, but these lists are too large and noisy to be useful for spelling correction.
回答2:
For spellchecking, you can try spacy_hunspell. You can add this to the pipeline.
More info and sample code is here: https://spacy.io/universe/project/spacy_hunspell
来源:https://stackoverflow.com/questions/59523161/spacy-word-in-vocabulary