Why does this regex result in four items?

后端 未结 3 1930
粉色の甜心
粉色の甜心 2021-01-24 03:44

I want to split a string by , ->, =>, or those wrapped with several spaces, meaning that I can get two items, she and <

相关标签:
3条回答
  • 2021-01-24 04:22

    If you can just strip your input string. From your description, all you need is to split the words on either \s+ or \s*->\s* or \s*=>\s*

    So here is my solution:

    p = re.compile(r'\s*[-=]>\s*|\s+')
    input1 = "she he"
    input2 = " she  -> he \n".strip()
    
    print p.split(input1)
    print p.split(input2)
    

    Your output would be just 'she' and 'he':

    ['she', 'he']
    
    0 讨论(0)
  • 2021-01-24 04:28

    As indicated in comments already, each pair of parentheses in your regex forms a capture group, and each of those is returned by the regex split() function. As per the documentation,

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

    Python regular expressions have a facility for non-capturing parentheses. You use (?: instead of just ( for the opening parenthesis to group without capturing.

    >>> re.compile("(?<!^)((\\s*[-=]>\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
    ['she', ' ', None, 'he']
    >>> re.compile("(?<!^)(?:(?:\\s*[-=]>\\s*)|[\\s+\t])(?!$\n)(?=[^\s])").split('she he')
    ['she', 'he']
    

    It's not exactly clear why you include a tab explicitly in the character class with \s; the \s already includes tab as one of the many whitespace characters it matches.

    It's also not clear what you expect $\n to match. $ is the end of the line and \n is a literal newline character, so you seem to be trying to deal with newlines; but $ alone already covers that. The only difference between $ and $\n is that if the end of the string (the last line in a multi-line string) is not newline-terminated, that will not match the latter.

    The (?<!^) is also peculiar -- a better way to avoid matching an empty string is to make sure your regular expression always matches something.

    From your requirements, it seems that

    re.compile(r'\s*[-=]>\s*|\s+').split('he she')
    

    would do what you want more succinctly and readably. This matches an ASCII arrow (single- or double-stoke) with optional whitespace on both sides, or if that fails, falls back to a sequence of whitespace.

    0 讨论(0)
  • 2021-01-24 04:30

    Each time you are using parentheses "()" you are creating a capturing group. A capturing group is a part of a match. A match always refers to the complete regex string. That is why you are getting 4 results.

    Documentation says: "If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."

    You could try making the groups "non-capturing" as Rawing suggested. Do this by simply prepending "?:" inside the parentheses you do not want to be captured.

    I would just leave out the parentheses altogether:

    res = re.compile("\\s*[-=]>\\s*|\\s*").split(' she  -> he \n')
    res = filter(None, res)
    res = list(res)
    

    Output:

    ['she', 'he']
    
    0 讨论(0)
提交回复
热议问题