How to split comma-separated key-value pairs with quoted commas

后端 未结 5 1477
终归单人心
终归单人心 2021-01-02 13:03

I know there are a lot of other posts about parsing comma-separated values, but I couldn\'t find one that splits key-value pairs and handles quoted commas.

I have st

相关标签:
5条回答
  • 2021-01-02 13:40

    Python seems to offer many ways to solve the task. Here is a little more c like implemented way, processing each char. Would be interesting to know different run times.

    str = 'age=12,name=bob,hobbies="games,reading",phrase="I\'m cool!"'
    key = ""
    val = ""
    dict = {}
    parse_string = False
    parse_key = True
    # parse_val = False
    for c in str:
        print(c)
        if c == '"' and not parse_string:
            parse_string = True
            continue
        elif c == '"' and parse_string:
            parse_string = False
            continue
        if parse_string:
            val += c
            continue
        if c == ',': # terminate entry
            dict[key] = val #add to dict
            key = ""
            val = ""
            parse_key = True
            continue
        elif c == '=' and parse_key:
            parse_key = False
        elif parse_key:
            key += c
        else:
            val+=c
    dict[key] = val
    print(dict.items())
    # {'phrase': "I'm cool!", 'age': '12', 'name': 'bob', 'hobbies': 'games,reading'}
    

    demo: http://repl.it/6oC/1

    0 讨论(0)
  • 2021-01-02 13:51

    It's possible to do with a regular expression. In this case, it might actually be the best option, too. I think this will work with most input, even escaped quotes such as this one: phrase='I\'m cool'

    With the VERBOSE flag, it's possible to make complicated regular expressions quite readable.

    import re
    text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
    regex = re.compile(
        r'''
            (?P<key>\w+)=      # Key consists of only alphanumerics
            (?P<quote>["']?)   # Optional quote character.
            (?P<value>.*?)     # Value is a non greedy match
            (?P=quote)         # Closing quote equals the first.
            ($|,)              # Entry ends with comma or end of string
        ''',
        re.VERBOSE
        )
    
    d = {match.group('key'): match.group('value') for match in regex.finditer(text)}
    
    print(d)  # {'name': 'bob', 'phrase': "I'm cool!", 'age': '12', 'hobbies': 'games,reading'}
    
    0 讨论(0)
  • 2021-01-02 13:52

    You just needed to use your shlex lexer in POSIX mode.

    Add posix=True when creating the lexer.

    (See the shlex parsing rules)

    lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''', posix=True)
    lexer.whitespace_split = True
    lexer.whitespace = ','
    props = dict(pair.split('=', 1) for pair in lexer)
    

    Outputs :

    {'age': '12', 'phrase': "I'm cool!", 'hobbies': 'games,reading', 'name': 'bob'}
    

    PS : Regular expressions won't be able to parse key-value pairs as long as the input can contain quoted = or , characters. Even preprocessing the string wouldn't be able to make the input be parsed by a regular expression, because that kind of input cannot be formally defined as a regular language.

    0 讨论(0)
  • 2021-01-02 13:52

    Ok, I actually figured out a pretty nifty way, which is to split on both comma and equal sign, then take 2 tokens at a time.

    input_str = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
    
    lexer = shlex.shlex(input_str)
    lexer.whitespace_split = True
    lexer.whitespace = ',='
    
    ret = {}
    try:
      while True:
        key = next(lexer)
        value = next(lexer)
    
        # Remove surrounding quotes
        if len(value) >= 2 and (value[0] == value[-1] == '"' or
                                value[0] == value[-1] == '\''):
          value = value[1:-1]
    
        ret[key] = value
    
    except StopIteration:
      # Somehow do error checking to see if you ended up with an extra token.
      pass
    
    print ret
    

    Then you get:

    {
      'age': '12',
      'name': 'bob',
      'hobbies': 'games,reading',
      'phrase': "I'm cool!",
    }
    

    However, this doesn't check that you don't have weird stuff like: age,12=name,bob, but I'm ok with that in my use case.

    EDIT: Handle both double-quotes and single-quotes.

    0 讨论(0)
  • 2021-01-02 13:54

    You could abuse Python tokenizer to parse the key-value list:

    #!/usr/bin/env python
    from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
    
    def parse_key_value_list(text):
        key = value = None
        for type, string, _,_,_ in generate_tokens(lambda it=iter([text]): next(it)):
            if type == NAME and key is None:
                key = string
            elif type in {NAME, NUMBER, STRING}:
                value = {
                    NAME: lambda x: x,
                    NUMBER: int,
                    STRING: lambda x: x[1:-1]
                }[type](string)
            elif ((type == OP and string == ',') or
                  (type == ENDMARKER and key is not None)):
                yield key, value
                key = value = None
    
    text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
    print(dict(parse_key_value_list(text)))
    

    Output

    {'phrase': "I'm cool!", 'age': 12, 'name': 'bob', 'hobbies': 'games,reading'}
    

    You could use a finite-state machine (FSM) to implement a stricter parser. The parser uses only the current state and the next token to parse input:

    #!/usr/bin/env python
    from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
    
    def parse_key_value_list(text):
        def check(condition):
            if not condition:
                raise ValueError((state, token))
    
        KEY, EQ, VALUE, SEP = range(4)
        state = KEY
        for token in generate_tokens(lambda it=iter([text]): next(it)):
            type, string = token[:2]
            if state == KEY:
                check(type == NAME)
                key = string
                state = EQ
            elif state == EQ:
                check(type == OP and string == '=')
                state = VALUE
            elif state == VALUE:
                check(type in {NAME, NUMBER, STRING})
                value = {
                    NAME: lambda x: x,
                    NUMBER: int,
                    STRING: lambda x: x[1:-1]
                }[type](string)
                state = SEP
            elif state == SEP:
                check(type == OP and string == ',' or type == ENDMARKER)
                yield key, value
                state = KEY
    
    text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
    print(dict(parse_key_value_list(text)))
    
    0 讨论(0)
提交回复
热议问题