Splitting on spaces, except between certain characters

后端 未结 4 1562
既然无缘
既然无缘 2020-12-06 13:37

I am parsing a file that has lines such as

type(\"book\") title(\"golden apples\") pages(10-35 70 200-234) comments(\"good read\")

And I want to split

相关标签:
4条回答
  • 2020-12-06 14:12

    Let me add a non-regex solution:

    line = 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")'
    
    count = 0 # Bracket counter
    last_break = 0 # Index of the last break
    parts = []
    for j,char in enumerate(line):
        if char is '(': count += 1
        elif char is ')': count -= 1
        elif char is ' ' and count is 0:
            parts.append(line[last_break:(j)])
            last_break = j+1
    parts.append(line[last_break:]) # Add last element
    parts = tuple(p for p in parts if p) # Convert to tuple and remove empty
    
    for p in parts:
        print(p)
    

    In general there are certain things you cannot do with regular expressions, and there can be serious performance penalties (especially for lookahead and lookbehind) which can cause them not to be the best solution for a certain problem.

    Also; I thought I'd mention the pyparsing module which can be used to create custom text parsers.

    0 讨论(0)
  • 2020-12-06 14:25

    This regex should work for you \s+(?=[^()]*(?:\(|$))

    result = re.split(r"\s+(?=[^()]*(?:\(|$))", subject)
    

    Explanation

    r"""
    \s             # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
       +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    (?=            # Assert that the regex below can be matched, starting at this position (positive lookahead)
       [^()]          # Match a single character NOT present in the list “()”
          *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       (?:              # Match the regular expression below
                         # Match either the regular expression below (attempting the next alternative only if this one fails)
             \(             # Match the character “(” literally
          |              # Or match regular expression number 2 below (the entire group fails if this one fails to match)
             $              # Assert position at the end of a line (at the end of the string or before a line break character)
       )
    )
    """
    
    0 讨论(0)
  • 2020-12-06 14:25

    I would try using a positive look-behind assertion.

    r'(?<=\))\s+'
    

    Example:

    >>> import re
    >>> result = re.split(r'(?<=\))\s+', 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")')
    >>> result
    ['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments(
    "good read")']
    
    0 讨论(0)
  • 2020-12-06 14:29

    Split on ") " and add a ) back to each element except the last.

    0 讨论(0)
提交回复
热议问题