Why does re.findall return a list of tuples when my pattern only contains one group?

只愿长相守 提交于 2019-12-07 04:39:47

问题


Say I have a string s containing letters and two delimiters 1 and 2. I want to split the string in the following way:

  • if a substring t falls between 1 and 2, return t
  • otherwise, return each character

So if s = 'ab1cd2efg1hij2k', the expected output is ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k'].

I tried to use regular expressions:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(1([a-z]+)2|[a-z])', s )

[('a', ''),
 ('b', ''),
 ('1cd2', 'cd'),
 ('e', ''),
 ('f', ''),
 ('g', ''),
 ('1hij2', 'hij'),
 ('k', '')]

From there i can do [ x[x[-1]!=''] for x in re.findall( r'(1([a-z]+)2|[a-z])', s ) ] to get my answer, but I still don't understand the output. The documentation says that findall returns a list of tuples if the pattern has more than one group. However, my pattern only contains one group. Any explanation is welcome.


回答1:


You pattern has two groups, the bigger group:

(1([a-z]+)2|[a-z])

and the second smaller group which is a subset of your first group:

([a-z]+)

Here is a solution that gives you the expected result although mind you, it is really ugly and there is probably a better way. I just can't figure it out:

import re
s = 'ab1cd2efg1hij2k'
a = re.findall( r'((?:1)([a-z]+)(?:2)|([a-z]))', s )
a = [tuple(j for j in i if j)[-1] for i in a]

>>> print a
['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']



回答2:


Your regular expression has 2 groups, just look at the number of parenthesis you are using :). One group would be ([a-z]+) and the other one (1([a-z]+)2|[a-z]). The key is that you can have groups inside other groups. So, if possible, you should build a regular expression with only one group, so that you don't have to post-process the result.

An example of regular expression with only one group would be:

>>> import re
>>> s = 'ab1cd2efg1hij2k'
>>> re.findall('((?<=1)[a-z]+(?=2)|[a-z])', s)
['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']



回答3:


I am 5 years too late to the party, but I think I might have found an elegant solution to the re.findall() ugly tuple-ridden output with multiple capture groups.

In general, if you end up with an output which looks something like that:

[('pattern_1', '', ''), ('', 'pattern_2', ''), ('pattern_1', '', ''), ('', '', 'pattern_3')]

Then you can bring it into a flat list with this little trick:

["".join(x) for x in re.findall(all_patterns, iterable)]

The expected output will be like so:

['pattern_1', 'pattern_2', 'pattern_1', 'pattern_3']

It was tested on Python 3.7. Hope it helps!




回答4:


Look at this answer for similar question: https://bugs.python.org/issue6663 Just drop the parenthesis if you are using findall:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(?<=1)[a-z]+(?=2)|[a-z]', s )


来源:https://stackoverflow.com/questions/24593824/why-does-re-findall-return-a-list-of-tuples-when-my-pattern-only-contains-one-gr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!