Split a string by spaces — preserving quoted substrings — in Python

后端 未结 16 680
心在旅途
心在旅途 2020-11-22 15:05

I have a string which is like this:

this is \"a test\"

I\'m trying to write something in Python to split it up by space while ignoring spac

相关标签:
16条回答
  • 2020-11-22 15:32

    I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:

    test = 'this is "a test"'  # or "this is 'a test'"
    # pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
    # From comments, use this:
    pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
    

    Explanation:

    [\\\"'] = double-quote or single-quote
    .* = anything
    ( |X) = space or X
    .strip() = remove space and empty-string separators
    

    shlex probably provides more features, though.

    0 讨论(0)
  • 2020-11-22 15:33

    Have a look at the shlex module, particularly shlex.split.

    >>> import shlex
    >>> shlex.split('This is "a test"')
    ['This', 'is', 'a test']
    
    0 讨论(0)
  • 2020-11-22 15:35

    To preserve quotes use this function:

    def getArgs(s):
        args = []
        cur = ''
        inQuotes = 0
        for char in s.strip():
            if char == ' ' and not inQuotes:
                args.append(cur)
                cur = ''
            elif char == '"' and not inQuotes:
                inQuotes = 1
                cur += char
            elif char == '"' and inQuotes:
                inQuotes = 0
                cur += char
            else:
                cur += char
        args.append(cur)
        return args
    
    0 讨论(0)
  • 2020-11-22 15:38

    Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

      [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
    

    This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

    If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

    [i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
    
    0 讨论(0)
  • 2020-11-22 15:41

    Speed test of different answers:

    import re
    import shlex
    import csv
    
    line = 'this is "a test"'
    
    %timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
    100000 loops, best of 3: 5.17 µs per loop
    
    %timeit re.findall(r'[^"\s]\S*|".+?"', line)
    100000 loops, best of 3: 2.88 µs per loop
    
    %timeit list(csv.reader([line], delimiter=" "))
    The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
    100000 loops, best of 3: 2.4 µs per loop
    
    %timeit shlex.split(line)
    10000 loops, best of 3: 50.2 µs per loop
    
    0 讨论(0)
  • 2020-11-22 15:42

    It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:

    re.findall("(?:\".*?\"|\S)+", s)
    

    Result:

    ['this', 'is', '"a test"']
    

    It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:

    >>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
    >>> a
    'She said "He said, \\"My name is Mark.\\""'
    >>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
    ...
    She
    said
    "He said, \"My name is Mark.\""
    

    Please note that this also matches the empty string "" by means of the \S part of the pattern.

    0 讨论(0)
提交回复
热议问题