Splitting comma delimited strings in python

后端 未结 3 938
礼貌的吻别
礼貌的吻别 2021-01-12 02:08

This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn\'t seem to be something somewhat more general. What I\'m looking for is

相关标签:
3条回答
  • 2021-01-12 02:10

    using iterators and generators:

    def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
        fst, snd = set(pairs.keys()), set(pairs.values())
        it = txt.__iter__()
    
        def loop():
            from collections import defaultdict
            cnt = defaultdict(int)
    
            while True:
                ch = it.__next__()
                if ch == delim and not any (cnt[x] for x in snd):
                    return
                elif ch in fst:
                    cnt[pairs[ch]] += 1
                elif ch in snd:
                    cnt[ch] -= 1
                yield ch
    
        while it.__length_hint__():
            yield ''.join(loop())
    

    and,

    >>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
    >>> [x for x in tokenize(txt)]
    ['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
    
    0 讨论(0)
  • 2021-01-12 02:26

    If you have recursive nested expressions, you can split on the commas and validate that they are matching doing this with pyparsing:

    import pyparsing as pp
    
    def CommaSplit(txt):
        ''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
        com_lok=[]
        comma = pp.Suppress(',')
        # note the location of each comma outside an ignored expression:
        comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
        ident = pp.Word(pp.alphas+"_", pp.alphanums+"_")  # python identifier
        ex1=(ident+pp.nestedExpr(opener='<', closer='>'))   # Ignore everthing inside nested '< >'
        ex2=(ident+pp.nestedExpr())                       # Ignore everthing inside nested '( )'
        ex3=pp.Regex(r'("|\').*?\1')                      # Ignore everything inside "'" or '"'
        atom = ex1 | ex2 | ex3 | comma
        expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma  + atom )
        try:
            result=expr.parseString(txt)
        except pp.ParseException:
            return [txt]
        else:    
            return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]             
    
    
    tests='''\
    obj<1, 2, 3>, x(4, 5), "msg, with comma"
    nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
    nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
    bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3>
    bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
    '''
    
    for te in tests.splitlines():
        result=CommaSplit(te)
        print(te,'==>\n\t',result)
    

    Prints:

    obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
         ['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
    nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
         ['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
    nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
         ['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
    bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
         ['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', '  ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
    bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
         ["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]
    

    The current behavior is just like '(something does not split), b, "in quotes", c'.split',') including keeping the leading spaces and the quotes. It is trivial to strip the quotes and leading spaces from the fields.

    Change the else under try to:

    else:
        rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
        if strip_fields:
            rtr=[e.strip().strip('\'"') for e in rtr]
        return rtr  
    
    0 讨论(0)
  • 2021-01-12 02:29

    While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:

    def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
        result = []
        buff = ""
        level = 0
        is_quoted = False
    
        for char in text:
            if char in delimiter and level == 0 and not is_quoted:
                result.append(buff)
                buff = ""
            else:
                buff += char
    
                if char in opens:
                    level += 1
                if char in closes:
                    level -= 1
                if char in quotes:
                    is_quoted = not is_quoted
    
        if not buff == "":
            result.append(buff)
    
        return result
    

    Running this in the interpreter:

    >>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')                                                                                                                                 
    #=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
    
    0 讨论(0)
提交回复
热议问题