I have a string which is like this:
this is \"a test\"
I\'m trying to write something in Python to split it up by space while ignoring spac
I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
Explanation:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
shlex probably provides more features, though.
Have a look at the shlex
module, particularly shlex.split
.
>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
To preserve quotes use this function:
def getArgs(s):
args = []
cur = ''
inQuotes = 0
for char in s.strip():
if char == ' ' and not inQuotes:
args.append(cur)
cur = ''
elif char == '"' and not inQuotes:
inQuotes = 1
cur += char
elif char == '"' and inQuotes:
inQuotes = 0
cur += char
else:
cur += char
args.append(cur)
return args
Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:
[i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
This works on strings like 'This is " a \\\"test\\\"\\\'s substring"'
(the insane markup is unfortunately necessary to keep Python from removing the escapes).
If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:
[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
Speed test of different answers:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
It seems that for performance reasons re
is faster. Here is my solution using a least greedy operator that preserves the outer quotes:
re.findall("(?:\".*?\"|\S)+", s)
Result:
['this', 'is', '"a test"']
It leaves constructs like aaa"bla blub"bbb
together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
Please note that this also matches the empty string ""
by means of the \S
part of the pattern.