Spacy replace token | 易学教程

问题

I am trying to replace a word without destroying the space structure in the sentence. Suppose I have the sentence text = "Hi this is my dog.". And I wish to replace dog with Simba. Following the answer from https://stackoverflow.com/a/57206316/2530674 I did:

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba .

Notice how there was an extra space at the end before the full stop (it ought to be Hi this is my Simba.). Is there a way to remove this behaviour. Happy for a general python string processing answer too.

回答1:

The below function replaces any number of matches (found with spaCy), keeps the same whitespacing as the original text, and appropriately handles edge cases (like when the match is at the beginning of the text):

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.

回答2:

One way to do this in an extensible way would be to use the spacy Matcher and to modify the Doc object, like so:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
matcher.add("dog", on_match, [{"LOWER": "dog"}])

def replace_word(doc, replacement):
    doc = nlp(doc)
    match_id, start, end = matcher(doc)[0] #assuming only one match replacement

    return nlp.make_doc(doc[:start].text + f" {replacement}" + doc[-1].text)

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

You could of course expand this pattern and replace all instances of "dog" by adding a for-loop in the function instead of just replacing the first match, and you could swap out rules in the matcher to change different words.

The nice thing about doing it this way, even though it's more complex, is that it lets you keep the other information in the spacy Doc object, like the lemmas, parts of speech, entities, dependency parse, etc.

But you if you just have a string, you don't need to worry about all that. To do this with plain Python, I'd use regex.

import re
def replace_word_re(text, word, replacement):
    return re.sub(word, replacement, text)

>>> replace_word_re("Hi this is my dog.", "dog", "Simba")
Hi this is my Simba.

回答3:

So it seems like you are looking for a regular replace? I would just do

string = "Hi this is my dog."
string = string.replace("dog","Simba")

回答4:

text = 'Hello This is my dog' print(text.replace('dog','simba'))

回答5:

Thanks to @lora-johns I found this answer. So without going down the matcher route, I think this might be a simpler answer:

new_words = [(token.idx, len("dog")) for token in doc1 if token.text.lower()=="dog"]
# reverse order of replacement words from end to start
new_words = sorted(new_words, key=lambda x:-x[0])
for i, l in new_words: 
    text = text[:i] +  "Simba" + text[i+l:]

回答6:

Here is how i do it with regex:

sentence = 'Hi this is my dog. dogdog this is mydog'
replacement = 'Simba'
to_replace = 'dog'
st = re.sub(f'(\W|^)+({to_replace})(\W|$)+', f'\g<1>{replacement}\g<3>', sentence)

来源：https://stackoverflow.com/questions/62785916/spacy-replace-token

标签

python

spacy