Split a string by spaces — preserving quoted substrings — in Python

后端未结

关注

 16  680

I have a string which is like this:

this is \"a test\"

I\'m trying to write something in Python to split it up by space while ignoring spac

相关标签:

16条回答

天涯浪人

2020-11-22 15:32
I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:
```
test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
```
Explanation:
```
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
```
shlex probably provides more features, though.
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 15:33
Have a look at the shlex module, particularly shlex.split.
```
>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

深忆病人

2020-11-22 15:35

To preserve quotes use this function:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

0 讨论(0)

余生分开走

2020-11-22 15:38
Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:
```
  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
```
This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:
```
[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

谎友^

2020-11-22 15:41

Speed test of different answers:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

0 讨论(0)

悲哀的现实

2020-11-22 15:42
It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:
```
re.findall("(?:\".*?\"|\S)+", s)
```
Result:
```
['this', 'is', '"a test"']
```
It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:
```
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
```
Please note that this also matches the empty string "" by means of the \S part of the pattern.
0 讨论(0)
发布评论:

提交评论
- 加载中...