How to add tags to negated words in strings that follow “not”, “no” and “never”

后端 未结 3 639
醉话见心
醉话见心 2021-01-03 12:01

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for

相关标签:
3条回答
  • 2021-01-03 12:13

    You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):

    • First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.

    • Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.

    • Join the string together again and insert the result in your original string in the place of the first regex's match.

    0 讨论(0)
  • 2021-01-03 12:17

    To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:

    import re
    string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
    transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
           lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
           string,
           flags=re.IGNORECASE)
    

    Will print (demo here)

    It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
    

    Explanation

    • The first step is to select the parts of your string you're interested in. This is done with

      \b(?:not|never|no)\b[\w\s]+[^\w\s]
      

      Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).

      Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.

    • Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with

      (\s+)(\w+)
      

      And replace them with what you want

      \1NEG_\2
      
    0 讨论(0)
  • 2021-01-03 12:26

    I would not do this with regexp. Rather I would;

    • Split the input on punctuation characters.
    • For each fragment do
    • Set negation counter to 0
    • Split input into words
    • For each word
    • Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
    • If original word is in {No,Never,Not} increase negation counter by one.
    0 讨论(0)
提交回复
热议问题