How do I split a comma delimited string in Python except for the commas that are within quotes

后端未结

关注

 4  917

执笔经年 2020-12-20 01:07

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are encl

4条回答

醉梦人生 (楼主)

2020-12-20 02:07

(Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)

import re

def parse_fields(text):
    r"""
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
    ['hey', 'hello', '', 'hello,world', 'hey,world']
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
    ['hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
    ['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(''))
    ['']
    >>> list(parse_fields(','))
    ['', '']
    >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
    ['testing', 'quotes not at "the" beginning \'of\' the', 'string']
    >>> list(parse_fields('testing,"unterminated quotes'))
    ['testing', '"unterminated quotes']
    """
    pos = 0
    exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
    while True:
        m = exp.search(text, pos)
        result = m.group(2)
        separator = m.group(3)

        yield result

        if not separator:
            break

        pos = m.end(0)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

(['"]?) matches an optional single- or double-quote.

(.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.

\1 is a backreference, to match the same single- or double-quote we matched earlier (if any).

(,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.

If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.

0 讨论(0)

查看其它4个回答