Python regex — extraneous matchings

后端未结

关注

 5  1242

执念已碎

I want to split a string using -, +=, ==, =, +, and white-space as delimiters. I want to keep the delimiter

相关标签:

5条回答

你的背包

2021-01-18 07:56
This pattern is more in line with what you want:
```
\s*(\-|\+\=|\=\=|\=|\+)\s*
```
You will still get an empty string between each split, though, as you should expect.
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-01-18 07:58
re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)
```
['hello', '', '', '', '', '', '', '', 'there']
```
Note the empty strings in between - and +=, += and ==, etc.

As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+), the bits that the capture group matches are interspersed:
```
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
```
None corresponds to where the \s+ half of your pattern was matched; in those cases, the capture group captured nothing.

From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard Nones and empty strings:
```
def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  return [ x for x in pattern.split(s) if x ]
```
One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2021-01-18 08:13
Why is it behaving this way?

According to the documentation for re.split:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

This is literally correct: if capturing parentheses are used, then the text of all groups are returned, whether or not they matched anything; the ones which didn't match anything return None.

As always with split, two consecutive delimiters are considered to separate empty strings, so you get empty strings interspersed.

how might I change it to get what I want?

The simplest solution is to filter the output:
```
filter(None, pattern.split(s))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

陌清茗

2021-01-18 08:14

Perhaps re.findall would be more suitable for you?

>>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+  there")
['hello', '-', '+=', '==', '=', '=', '+', 'there']

0 讨论(0)

暗喜

2021-01-18 08:21

Try this:

def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  x = pattern.split(s)
  result = []
  for item in x:
    if item != '' and item != None:
      result.append(item)
  return result

print(tokenize("hello-+==== =+  there"))

0 讨论(0)