Python regex — extraneous matchings

后端 未结 5 1240
执念已碎
执念已碎 2021-01-18 07:17

I want to split a string using -, +=, ==, =, +, and white-space as delimiters. I want to keep the delimiter

相关标签:
5条回答
  • 2021-01-18 07:56

    This pattern is more in line with what you want:

    \s*(\-|\+\=|\=\=|\=|\+)\s*
    

    You will still get an empty string between each split, though, as you should expect.

    0 讨论(0)
  • 2021-01-18 07:58

    re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)

    ['hello', '', '', '', '', '', '', '', 'there']
    

    Note the empty strings in between - and +=, += and ==, etc.

    As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+), the bits that the capture group matches are interspersed:

    ['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
    

    None corresponds to where the \s+ half of your pattern was matched; in those cases, the capture group captured nothing.

    From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard Nones and empty strings:

    def tokenize(s):
      import re
      pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
      return [ x for x in pattern.split(s) if x ]
    

    One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.

    0 讨论(0)
  • 2021-01-18 08:13

    Why is it behaving this way?

    According to the documentation for re.split:

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

    This is literally correct: if capturing parentheses are used, then the text of all groups are returned, whether or not they matched anything; the ones which didn't match anything return None.

    As always with split, two consecutive delimiters are considered to separate empty strings, so you get empty strings interspersed.

    how might I change it to get what I want?

    The simplest solution is to filter the output:

    filter(None, pattern.split(s))
    
    0 讨论(0)
  • 2021-01-18 08:14

    Perhaps re.findall would be more suitable for you?

    >>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+  there")
    ['hello', '-', '+=', '==', '=', '=', '+', 'there']
    
    0 讨论(0)
  • 2021-01-18 08:21

    Try this:

    def tokenize(s):
      import re
      pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
      x = pattern.split(s)
      result = []
      for item in x:
        if item != '' and item != None:
          result.append(item)
      return result
    
    print(tokenize("hello-+==== =+  there"))
    
    0 讨论(0)
提交回复
热议问题