Can a token be removed from a spaCy document during pipeline processing?

前端 未结 1 457
悲哀的现实
悲哀的现实 2021-01-15 20:09

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in

相关标签:
1条回答
  • 2021-01-15 20:36

    spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

    While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:

    from spacy.tokens import Token
    
    def get_is_excluded(token):
        # Getter function to determine the value of token._.is_excluded
        return token.text in ['some', 'excluded', 'words']
    
    Token.set_extension('is_excluded', getter=get_is_excluded)
    

    When processing a Doc, you can now filter it to only get the tokens that are not excluded:

    doc = nlp("Test that tokens are excluded")
    print([token.text for token if not token._.is_excluded])
    # ['Test', 'that', 'tokens', 'are']
    

    You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.

    Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

    0 讨论(0)
提交回复
热议问题