Split a string with custom delimiter, respect and preserve quotes (single or double)

前端未结

关注

 2  1477

I have a string which is like this:

>>> s = \'1,\",2, \",,4,,,\\\',7, \\\',8,,10,\'
>>> s
\'1,\",2, \",,4,,,\\\',7, \\\',8,,10,\'

相关标签:

2条回答

陌清茗

2021-01-29 10:35

A modified version of this (which handles only white spaces) can do the trick (quotes are stripped):

>>> import re
>>> s = '1,",2, ",,4,,,\',7, \',8,,10,'

>>> tokens = [t for t in re.split(r",?\"(.*?)\",?|,?'(.*?)',?|,", s) if t is not None ]
>>> tokens
['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']

And if you like to keep the quotes characters:

>>> tokens = [t for t in re.split(r",?(\".*?\"),?|,?('.*?'),?|,", s) if t is not None ]
>>> tokens
['1', '",2, "', '', '4', '', '', "',7, '", '8', '', '10', '']

If you want to use a custom delimiter replace every occurrence of , in the regexp with your own delimiter.

Explanation:

| = match alternatives e.g. ( |X) = space or X
.* = anything
x? = x or nothing
() = capture the content of a matched pattern

We have 3 alternatives:

1 "text"    -> ".*?" -> due to escaping rules becomes - > \".*?\"
2 'text'    -> '.*?'
3 delimiter ->  ,

Since we want to capture the content of the text inside the quotes, we use ():

1 \"(.*?)\"   (to keep the quotes use (\".*?\")
2 '(.*?)'     (to keep the quotes use ('.*?')

Finally we don't want that split function reports an empty match if a
delimiter precedes and follows quotes, so we capture that possible
delimiter too:

1 ,?\"(.*?)\",?
2 ,?'(.*?)',?

Once we use the | operator to join the 3 possibilities we get this regexp:

r",?\"(.*?)\",?|,?'(.*?)',?|,"

0 讨论(0)

忘掉有多难

2021-01-29 10:41
It looks like you are reinventing python module csv. Batteries included.
```
In [1]: import csv
In [2]: s = '1,",2, ",,4,,,\',7, \',8,,10,'
In [3]: next(csv.reader([s]))
Out[3]: ['1', ',2, ', '', '4', '', '', "'", '7', " '", '8', '', '10', '']
```
I think, regexp's often are not good solution. It can be surprisingly slow in unexpected moments. In csv module can adjust dialect and it's easy to process any numner of strings/file.

I've failed to adjust csv to two variants of quotechar at the same time, but do you really need it?
```
In [4]: next(csv.reader([s], quotechar="'"))
Out[4]: ['1', '"', '2', ' "', '', '4', '', '', ',7, ', '8', '', '10', '']
```
or
```
In [5]: s = '1,",2, ",,4,,,",7, ",8,,10,'
In [6]: next(csv.reader([s]))
Out[6]: ['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...