I want to split a string by ,
->
, =>
, or those wrapped with several spaces, meaning that I can get two items, she
and <
As indicated in comments already, each pair of parentheses in your regex forms a capture group, and each of those is returned by the regex split()
function. As per the documentation,
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Python regular expressions have a facility for non-capturing parentheses. You use (?:
instead of just (
for the opening parenthesis to group without capturing.
>>> re.compile("(?\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
['she', ' ', None, 'he']
>>> re.compile("(?\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
['she', 'he']
It's not exactly clear why you include a tab explicitly in the character class with \s
; the \s
already includes tab as one of the many whitespace characters it matches.
It's also not clear what you expect $\n
to match. $
is the end of the line and \n
is a literal newline character, so you seem to be trying to deal with newlines; but $
alone already covers that. The only difference between $
and $\n
is that if the end of the string (the last line in a multi-line string) is not newline-terminated, that will not match the latter.
The (? is also peculiar -- a better way to avoid matching an empty string is to make sure your regular expression always matches something.
From your requirements, it seems that
re.compile(r'\s*[-=]>\s*|\s+').split('he she')
would do what you want more succinctly and readably. This matches an ASCII arrow (single- or double-stoke) with optional whitespace on both sides, or if that fails, falls back to a sequence of whitespace.