I have a string which is like this:
this is \"a test\"
I\'m trying to write something in Python to split it up by space while ignoring spac
Depending on your use case, you may also want to check out the csv module:
import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
print(row)
Output:
['this', 'is', 'a string']
['and', 'more', 'stuff']
Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.
Both versions do the same thing, but splitter is a bit more readable then splitter2.
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
I suggest:
test string:
s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''
to capture also "" and '':
import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)
result:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]
to ignore empty "" and '':
import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)
result:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
If you don't care about sub strings than a simple
>>> 'a short sized string with spaces '.split()
Performance:
>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass
Or string module
>>> from string import split as stringsplit;
>>> stringsplit('a short sized string with spaces '*100)
Performance: String module seems to perform better than string methods
>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass
Or you can use RE engine
>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)
Performance
>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass
For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop