How do I split a comma delimited string in Python except for the commas that are within quotes

后端 未结 4 918
执笔经年
执笔经年 2020-12-20 01:07

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are encl

相关标签:
4条回答
  • 2020-12-20 01:53

    I fabricated something like this. Very redundant I suppose, but it does the job for me. You have to adapt it a bit to your specifications:

    def csv_splitter(line):
        splitthese = [0]
        splitted = []
        splitpos = True
        for nr, i in enumerate(line):
            if i == "\"" and splitpos == True: 
                splitpos = False
            elif i == "\"" and splitpos == False:
                splitpos = True
            if i == "," and splitpos == True:
                splitthese.append(nr)
        splitthese.append(len(line)+1) 
        for i in range(len(splitthese)-1):
            splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
        return splitted
    
    0 讨论(0)
  • 2020-12-20 01:59

    The csv module won't handle the scenario of " and ' being quotes at the same time. Absent a module that provides that kind of dialect, one has to get into the parsing business. To avoid reliance on a third party module, we can use the re module to do the lexical analysis, using the re.MatchObject.lastindex gimmick to associate a token type with the matched pattern.

    The following code when run as a script passes all the tests shown, with Python 2.7 and 2.2.

    import re
    
    # lexical token symbols
    DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
    
    _pattern_tuples = (
        (r'"[^"]*"', DQUOTED),
        (r"'[^']*'", SQUOTED),
        (r",", COMMA),
        (r"$", NEWLINE), # matches end of string OR \n just before end of string
        (r"[^,\n]+", UNQUOTED), # order in the above list is important
        )
    _matcher = re.compile(
        '(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
        ).match
    _toktype = [None] + [i[1] for i in _pattern_tuples]
    # need dummy at start because re.MatchObject.lastindex counts from 1 
    
    def csv_split(text):
        """Split a csv string into a list of fields.
        Fields may be quoted with " or ' or be unquoted.
        An unquoted string can contain both a " and a ', provided neither is at
        the start of the string.
        A trailing \n will be ignored if present.
        """
        fields = []
        pos = 0
        want_field = True
        while 1:
            m = _matcher(text, pos)
            if not m:
                raise ValueError("Problem at offset %d in %r" % (pos, text))
            ttype = _toktype[m.lastindex]
            if want_field:
                if ttype in (DQUOTED, SQUOTED):
                    fields.append(m.group(0)[1:-1])
                    want_field = False
                elif ttype == UNQUOTED:
                    fields.append(m.group(0))
                    want_field = False
                elif ttype == COMMA:
                    fields.append("")
                else:
                    assert ttype == NEWLINE
                    fields.append("")
                    break
            else:
                if ttype == COMMA:
                    want_field = True
                elif ttype == NEWLINE:
                    break
                else:
                    print "*** Error dump ***", ttype, repr(m.group(0)), fields
                    raise ValueError("Missing comma at offset %d in %r" % (pos, text))
            pos = m.end(0)
        return fields
    
    if __name__ == "__main__":
        tests = (
            ("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
            ("""\n""", ['']),
            ("""""", ['']),
            ("""a,b\n""", ['a', 'b']),
            ("""a,b""", ['a', 'b']),
            (""",,,\n""", ['', '', '', '']),
            ("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
            ("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
            )
        for text, expected in tests:
            result = csv_split(text)
            print
            print repr(text)
            print repr(result)
            print repr(expected)
            print result == expected
    
    0 讨论(0)
  • 2020-12-20 02:05

    Sounds like you want the CSV module.

    0 讨论(0)
  • 2020-12-20 02:07

    (Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)

    import re
    
    def parse_fields(text):
        r"""
        >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
        ['hey', 'hello', '', 'hello,world', 'hey,world']
        >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
        ['hey', 'hello', '', 'hello,world', 'hey,world', '']
        >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
        ['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
        >>> list(parse_fields(''))
        ['']
        >>> list(parse_fields(','))
        ['', '']
        >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
        ['testing', 'quotes not at "the" beginning \'of\' the', 'string']
        >>> list(parse_fields('testing,"unterminated quotes'))
        ['testing', '"unterminated quotes']
        """
        pos = 0
        exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
        while True:
            m = exp.search(text, pos)
            result = m.group(2)
            separator = m.group(3)
    
            yield result
    
            if not separator:
                break
    
            pos = m.end(0)
    
    if __name__ == "__main__":
        import doctest
        doctest.testmod()
    

    (['"]?) matches an optional single- or double-quote.

    (.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.

    \1 is a backreference, to match the same single- or double-quote we matched earlier (if any).

    (,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.

    If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.

    0 讨论(0)
提交回复
热议问题