Split a string with custom delimiter, respect and preserve quotes (single or double)

前端 未结 2 1477
予麋鹿
予麋鹿 2021-01-29 10:04

I have a string which is like this:

>>> s = \'1,\",2, \",,4,,,\\\',7, \\\',8,,10,\'
>>> s
\'1,\",2, \",,4,,,\\\',7, \\\',8,,10,\'
相关标签:
2条回答
  • 2021-01-29 10:35

    A modified version of this (which handles only white spaces) can do the trick (quotes are stripped):

    >>> import re
    >>> s = '1,",2, ",,4,,,\',7, \',8,,10,'
    
    >>> tokens = [t for t in re.split(r",?\"(.*?)\",?|,?'(.*?)',?|,", s) if t is not None ]
    >>> tokens
    ['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']
    

    And if you like to keep the quotes characters:

    >>> tokens = [t for t in re.split(r",?(\".*?\"),?|,?('.*?'),?|,", s) if t is not None ]
    >>> tokens
    ['1', '",2, "', '', '4', '', '', "',7, '", '8', '', '10', '']
    

    If you want to use a custom delimiter replace every occurrence of , in the regexp with your own delimiter.

    Explanation:

    | = match alternatives e.g. ( |X) = space or X
    .* = anything
    x? = x or nothing
    () = capture the content of a matched pattern
    
    We have 3 alternatives:
    
    1 "text"    -> ".*?" -> due to escaping rules becomes - > \".*?\"
    2 'text'    -> '.*?'
    3 delimiter ->  ,
    
    Since we want to capture the content of the text inside the quotes, we use ():
    
    1 \"(.*?)\"   (to keep the quotes use (\".*?\")
    2 '(.*?)'     (to keep the quotes use ('.*?')
    
    Finally we don't want that split function reports an empty match if a
    delimiter precedes and follows quotes, so we capture that possible
    delimiter too:
    
    1 ,?\"(.*?)\",?
    2 ,?'(.*?)',?
    
    Once we use the | operator to join the 3 possibilities we get this regexp:
    
    r",?\"(.*?)\",?|,?'(.*?)',?|,"
    
    0 讨论(0)
  • 2021-01-29 10:41

    It looks like you are reinventing python module csv. Batteries included.

    In [1]: import csv
    In [2]: s = '1,",2, ",,4,,,\',7, \',8,,10,'
    In [3]: next(csv.reader([s]))
    Out[3]: ['1', ',2, ', '', '4', '', '', "'", '7', " '", '8', '', '10', '']
    

    I think, regexp's often are not good solution. It can be surprisingly slow in unexpected moments. In csv module can adjust dialect and it's easy to process any numner of strings/file.

    I've failed to adjust csv to two variants of quotechar at the same time, but do you really need it?

    In [4]: next(csv.reader([s], quotechar="'"))
    Out[4]: ['1', '"', '2', ' "', '', '4', '', '', ',7, ', '8', '', '10', '']
    

    or

    In [5]: s = '1,",2, ",,4,,,",7, ",8,,10,'
    In [6]: next(csv.reader([s]))
    Out[6]: ['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']
    
    0 讨论(0)
提交回复
热议问题