Can a token be removed from a spaCy document during pipeline processing?

前端未结

关注

 1  457

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in

相关标签:

1条回答

长情又很酷

2021-01-15 20:36
spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:
```
from spacy.tokens import Token

def get_is_excluded(token):
    # Getter function to determine the value of token._.is_excluded
    return token.text in ['some', 'excluded', 'words']

Token.set_extension('is_excluded', getter=get_is_excluded)
```
When processing a Doc, you can now filter it to only get the tokens that are not excluded:
```
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
```
You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.

Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).
0 讨论(0)
发布评论:

提交评论
- 加载中...