问题
In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.
Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP
Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.
Thanks!
回答1:
This should be much faster than regex and you can pass a list of separators as you wanted:
def split(txt, seps):
default_sep = seps[0]
# we skip seps[0] because that's the default separator
for sep in seps[1:]:
txt = txt.replace(sep, default_sep)
return [i.strip() for i in txt.split(default_sep)]
How to use it:
>>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
['ABC', 'DEF123', 'GHI_JKL', 'MN OP']
Performance test:
import timeit
import re
TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
SEPS = (',', ';')
rsplit = re.compile("|".join(SEPS)).split
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 1.6242462980007986
print(timeit.timeit(lambda: split(TEST, SEPS)))
# 1.3588597209964064
And with a much longer input string:
TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 130.67168392999884
print(timeit.timeit(lambda: split(TEST, SEPS)))
# 50.31940778599528
回答2:
Using regular expressions, try
[s.strip() for s in re.split(",|;", string)]
or
[t.strip() for s in string.split(",") for t in s.split(";")]
without.
回答3:
Taking the above answer, with your test cases, you want to use a regular expression, and one or more separation characters. In your case, the separation characters seem to be ',', '|', ';' and whitespace. Whitespace in python is '\w', so the comprehension is:
import re
list = [s for s in re.split("[,|;\W]+", string)]
I cannot reply to sven's answer above, but I split on one or more of the characters inside the brackets, and don't have to use the strip() method.
Yikes, I didn't read the question correctly... Sven's answer with the strip works; mine assumes the whitespace is another separation.
回答4:
>>> re.split('\s*,\s*|\s*;\s*', 'a , b; cdf')
['a', 'b', 'cdf']
来源:https://stackoverflow.com/questions/4697006/python-split-string-by-list-of-separators