Split a string by spaces — preserving quoted substrings — in Python

后端 未结 16 679
心在旅途
心在旅途 2020-11-22 15:05

I have a string which is like this:

this is \"a test\"

I\'m trying to write something in Python to split it up by space while ignoring spac

相关标签:
16条回答
  • 2020-11-22 15:46

    Depending on your use case, you may also want to check out the csv module:

    import csv
    lines = ['this is "a string"', 'and more "stuff"']
    for row in csv.reader(lines, delimiter=" "):
        print(row)
    

    Output:

    ['this', 'is', 'a string']
    ['and', 'more', 'stuff']
    
    0 讨论(0)
  • 2020-11-22 15:47

    Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

    Both versions do the same thing, but splitter is a bit more readable then splitter2.

    import re
    
    s = 'this is "a test" some text "another test"'
    
    def splitter(s):
        def replacer(m):
            return m.group(0).replace(" ", "\x00")
        parts = re.sub('".+?"', replacer, s).split()
        parts = [p.replace("\x00", " ") for p in parts]
        return parts
    
    def splitter2(s):
        return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
    
    print splitter2(s)
    
    0 讨论(0)
  • 2020-11-22 15:48

    I suggest:

    test string:

    s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''
    

    to capture also "" and '':

    import re
    re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)
    

    result:

    ['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]
    

    to ignore empty "" and '':

    import re
    re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)
    

    result:

    ['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
    
    0 讨论(0)
  • 2020-11-22 15:49

    If you don't care about sub strings than a simple

    >>> 'a short sized string with spaces '.split()
    

    Performance:

    >>> s = " ('a short sized string with spaces '*100).split() "
    >>> t = timeit.Timer(stmt=s)
    >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
    171.39 usec/pass
    

    Or string module

    >>> from string import split as stringsplit; 
    >>> stringsplit('a short sized string with spaces '*100)
    

    Performance: String module seems to perform better than string methods

    >>> s = "stringsplit('a short sized string with spaces '*100)"
    >>> t = timeit.Timer(s, "from string import split as stringsplit")
    >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
    154.88 usec/pass
    

    Or you can use RE engine

    >>> from re import split as resplit
    >>> regex = '\s+'
    >>> medstring = 'a short sized string with spaces '*100
    >>> resplit(regex, medstring)
    

    Performance

    >>> s = "resplit(regex, medstring)"
    >>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
    >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
    540.21 usec/pass
    

    For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop

    0 讨论(0)
提交回复
热议问题