How do I add the tag NEG_
to all words that follow not
, no
and never
until the next punctuation mark in a string(used for
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not
, no
or never
. The regex \b(?:not?|never)\b([^.,:;!?]+)
would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_
to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.
To make up for Python's re
regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub
function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b
is a word boundary, (?:...)
a non capturing group), followed by alpahnum and spaces (\w
is [0-9a-zA-Z_]
, \s
is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s]
to match end of string as well.
Now you're dealing with never going to work,
kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;