How do I split a comma delimited string in Python except for the commas that are within quotes

后端 未结 4 917
执笔经年
执笔经年 2020-12-20 01:07

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are encl

4条回答
  •  醉梦人生
    2020-12-20 02:07

    (Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)

    import re
    
    def parse_fields(text):
        r"""
        >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
        ['hey', 'hello', '', 'hello,world', 'hey,world']
        >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
        ['hey', 'hello', '', 'hello,world', 'hey,world', '']
        >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
        ['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
        >>> list(parse_fields(''))
        ['']
        >>> list(parse_fields(','))
        ['', '']
        >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
        ['testing', 'quotes not at "the" beginning \'of\' the', 'string']
        >>> list(parse_fields('testing,"unterminated quotes'))
        ['testing', '"unterminated quotes']
        """
        pos = 0
        exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
        while True:
            m = exp.search(text, pos)
            result = m.group(2)
            separator = m.group(3)
    
            yield result
    
            if not separator:
                break
    
            pos = m.end(0)
    
    if __name__ == "__main__":
        import doctest
        doctest.testmod()
    

    (['"]?) matches an optional single- or double-quote.

    (.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.

    \1 is a backreference, to match the same single- or double-quote we matched earlier (if any).

    (,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.

    If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.

提交回复
热议问题