Why does this regex result in four items?

后端 未结 3 1936
粉色の甜心
粉色の甜心 2021-01-24 03:44

I want to split a string by , ->, =>, or those wrapped with several spaces, meaning that I can get two items, she and <

3条回答
  •  滥情空心
    2021-01-24 04:28

    As indicated in comments already, each pair of parentheses in your regex forms a capture group, and each of those is returned by the regex split() function. As per the documentation,

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

    Python regular expressions have a facility for non-capturing parentheses. You use (?: instead of just ( for the opening parenthesis to group without capturing.

    >>> re.compile("(?\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
    ['she', ' ', None, 'he']
    >>> re.compile("(?\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
    ['she', 'he']
    

    It's not exactly clear why you include a tab explicitly in the character class with \s; the \s already includes tab as one of the many whitespace characters it matches.

    It's also not clear what you expect $\n to match. $ is the end of the line and \n is a literal newline character, so you seem to be trying to deal with newlines; but $ alone already covers that. The only difference between $ and $\n is that if the end of the string (the last line in a multi-line string) is not newline-terminated, that will not match the latter.

    The (? is also peculiar -- a better way to avoid matching an empty string is to make sure your regular expression always matches something.

    From your requirements, it seems that

    re.compile(r'\s*[-=]>\s*|\s+').split('he she')
    

    would do what you want more succinctly and readably. This matches an ASCII arrow (single- or double-stoke) with optional whitespace on both sides, or if that fails, falls back to a sequence of whitespace.

提交回复
热议问题