How to add tags to negated words in strings that follow “not”, “no” and “never”

后端未结

关注

 3  639

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for

相关标签:

3条回答

悲哀的现实

2021-01-03 12:13
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
- First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
- Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
- Join the string together again and insert the result in your original string in the place of the first regex's match.
0 讨论(0)
发布评论:

提交评论
- 加载中...
独厮守ぢ

2021-01-03 12:17
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
```
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
       lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
       string,
       flags=re.IGNORECASE)
```
Will print (demo here)
```
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
```
Explanation
- The first step is to select the parts of your string you're interested in. This is done with
```
\b(?:not|never|no)\b[\w\s]+[^\w\s]
```
  Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
  
  Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
- Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
```
(\s+)(\w+)
```
  And replace them with what you want
```
\1NEG_\2
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2021-01-03 12:26
I would not do this with regexp. Rather I would;
- Split the input on punctuation characters.
- For each fragment do
- Set negation counter to 0
- Split input into words
- For each word
- Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
- If original word is in {No,Never,Not} increase negation counter by one.
0 讨论(0)
发布评论:

提交评论
- 加载中...