using python to parse colon (:) delimited string to an object

前端 未结 4 1802
臣服心动
臣服心动 2021-01-20 16:29

i have a string which is a return value of REST API (http://requesttracker.wikia.com/wiki/REST) and is using colon seperated key/value pairs.

id: 123414
nam         


        
相关标签:
4条回答
  • 2021-01-20 16:53

    You really need to say which REST api and provide a documentation reference.

    Superficially, it doesn't look too hard:

    # Look Ma, no imports!
    >>> s = 'id: 1234\nname: Peter\nmessage: foo bar zot\nmsg2: tee:hee\n'
    >>> dict(map(str.strip, line.split(':', 1)) for line in s.splitlines())
    {'message': 'foo bar zot', 'msg2': 'tee:hee', 'id': '1234', 'name': 'Peter'}
    

    But: (1) the documentation should point you at a parser (2) nothing is ever as easy as it seems from one simple example (see tee:hee above); if you decide on rolling your own, you should break the above one-liner up into multiple steps so that you can do some error checking (e.g. line.split() returns exactly 2 pieces).

    Update after api reference was given:

    At first glance, the website gives an enormous number of examples without actually stating what the format is. I suggest that you give it more than a glance; if that fails, ask the author/maintainer.

    Update 2 after actual example input given, and after comment "I just tried this and got crashed":

    The code supplied was in response to the first (ambiguous) example input, in which all lines except the last contained a colon. It was accompanied by a suggestion that it should be done in pieces instead of a one-liner with especial mention of checking the result of split(':', 1). What code did you use? What exactly does "got crashed" mean? Have you tried to work out for yourself what your problem was, and fix it?

    What data did you feed it? Your long-awaited actual sample has colon-separated key:value lines preceded by a heading line and an empty line and followed by an empty line. These can be blissfully ignored by a trivial adjustment to the one-liner:

    >>> print dict(map(str.strip, line.split(':', 1)) for line in s.splitlines()[2:-1])
    {'Status': 'new', 'Resolved': 'Not set', 'CF.{Severity}': '',
    'TimeLeft': '0', 'Creator': 'young.park', 'Cc': '', 'Starts': 'Not set',
    'Created': 'Mon Apr 25 15:50:27 2011', 'Due': 'Not set',
    'LastUpdated': 'Mon Apr 25 15:50:28 2011', 'Started': 'Not set',
    'Priority': '0', 'Requestors': 'superuser@meme.com',
    'AdminCc': '', 'Owner': 'Nobody', 'Told': 'Not set',
    'TimeEstimated': '0', 'InitialPriority': '0', 'FinalPriority': '0',
    'TimeWorked': '0', 'Subject': 'testing'}
    >>>
    

    Note 1: above output edited manually to avoid horizontal scrolling.

    Note 2: Includes the Created and LastUpdated entries (-:whose values contain colons:-)

    If you don't believe in blissfully ignoring things, you can do the splitlines first, and assert that the first line contains something like the expected heading, and that the second and last lines are empty.

    0 讨论(0)
  • 2021-01-20 16:59

    Examples look like customized http messages (but they are not; it would be too simple); you could use rfc822.Message to parse them:

    import rfc822
    from cStringIO import StringIO
    
    # skip status line; read headers
    m = rfc822.Message(StringIO(raw_text[raw_text.index('\n\n')+2:]))
    

    Now you have access to individual headers:

    >>> m.getheader('queue')
    'customer-test'
    >>> m.getrawheader('queue')
    ' customer-test\n'
    >>> m.getheader('created')
    'Mon Apr 25 15:50:27 2011'
    >>> m.getdate('created')
    (2011, 4, 25, 15, 50, 27, 0, 1, 0)
    

    All headers:

    >>> from pprint import pprint
    >>> pprint(dict(m.items()))
    {'admincc': '',
     'cc': '',
     'cf.{severity}': '',
     'created': 'Mon Apr 25 15:50:27 2011',
     'creator': 'young.park',
     'due': 'Not set',
     'finalpriority': '0',
     'id': 'ticket/46863',
     'initialpriority': '0',
     'lastupdated': 'Mon Apr 25 15:50:28 2011',
     'owner': 'Nobody',
     'priority': '0',
     'queue': 'customer-test',
     'requestors': 'superuser@meme.com',
     'resolved': 'Not set',
     'started': 'Not set',
     'starts': 'Not set',
     'status': 'new',
     'subject': 'testing',
     'timeestimated': '0',
     'timeleft': '0',
     'timeworked': '0',
     'told': 'Not set'}
    
    0 讨论(0)
  • 2021-01-20 17:03

    Given your poor question , we are driven to imagine what is the crucial problem, because I can't believe you had never heard about the string's method, so I think that you have no idea how to use them in this case.

    There's certainly a way to get what you want with string's methods, I have an idea about that, but I prefer to turn directly to the regex tool, thinking that the difficulty is to catch a second part after a colon having newlines in it

    import re
    
    regx = re.compile ('(^[^:]+):((?:[^:]+\r?\n)*[^:]+)$',re.MULTILINE)
    
    coloned = '''id: 123414
    name: Peter
    message: bla bla
    bla bla
    the end: of the text'''
    
    print regx.findall(coloned)
    

    gives

    [('id', ' 123414'), ('name', ' Peter'), ('message', ' bla bla\nbla bla'), ('the end', ' of the text')]
    

    .

    EDIT

    So there was no difficulty in this "problem"

    import re
    
    regx = re.compile ('^([^:\n]+): *(.*?) *$',re.MULTILINE)
    
    ch = ('RT/3.8.8 200 Ok\n'                                    '\n'
          'id: ticket/46863\n'      'Queue: customer-test\n'
          'Owner: Nobo:dy\n'        'Creator: young.park\n'
          'Subject: testing\n'      'Status: new\n'
          'Priority: 0\n'           'InitialPriority: 0\n'
          'FinalPriority: 0\n'      'Requestors: superuser@meme.com\n'
          'Cc:\nAdminCc:\n'         'Created: Mon Apr 25 15:50:27 2011\n'
          'Starts: Not set\n'       'Started: Not set\n'
          'Due: Not set\n'          'Resolved: Not set\n'
          'Told: Not set\n'         'LastUpdated: Mon Apr 25 15:50:28 2011\n'
          'TimeEstimated: 0\n'      'TimeWorked: 0\n'
          'TimeLeft: 0\n'           'CF.{Severity}: \n'           '\n')
    
    print dict(regx.findall(ch))
    print
    
    s = 'id: 1234\nname: Peter\nmessage: foo bar zot\nmsg2: tee:hee\n'
    print dict(regx.findall(s))
    

    result

    {'Due': 'Not set', 'Priority': '0', 'id': 'ticket/46863', 'Told': 'Not set', 'Status': 'new', 'Started': 'Not set', 'Requestors': 'superuser@meme.com', 'FinalPriority': '0', 'Resolved': 'Not set', 'Created': 'Mon Apr 25 15:50:27 2011', 'AdminCc': '', 'Starts': 'Not set', 'Queue': 'customer-test', 'TimeWorked': '0', 'TimeLeft': '0', 'Creator': 'young.park', 'Cc': '', 'LastUpdated': 'Mon Apr 25 15:50:28 2011', 'CF.{Severity}': '', 'Owner': 'Nobo:dy', 'TimeEstimated': '0', 'InitialPriority': '0', 'Subject': 'testing'}
    
    {'message': 'foo bar zot', 'msg2': 'tee:hee', 'id': '1234', 'name': 'Peter'}
    

    .

    John Machin, I didn't mucked about this new regex, it took me one minute to rewrite, and it wouldn't have taken a lot more time at first if we wouldn't have to beg for the essential basic information needed to answer

    Three remarks:

    • if the input ever changes and a supplementary empty line appear anywhere among the others, your solution will crash, while my regex solution will continue to work well. Your solution needs to be completed with if ':' in line

    • I compared the execution times:

      my regex sol 0.000152533352703 seconds , yours 0.000225727012791 ( + 48 % )

    With if ':' in line added, it is slightly longer : 0.000246958761519 seconds ( + 62 % )

    Speed isn't important here, but in other applications, it is good to know that regexes are very fast (100 times faster than lxml, and 1000 faster than BeautifulSoup)

    • you are a specialist of CSV format. A solution with StringIO and csv module 's functions could also be possible
    0 讨论(0)
  • 2021-01-20 17:07

    That looks like YAML. Have you tried PyYAML?

    >>> import yaml
    >>> s = """id: 123414
    ... name: Peter
    ... message: bla bla
    ...   bla bla"""
    >>> yaml.load(s)
    {'message': 'bla bla bla bla', 'id': 123414, 'name': 'Peter'}
    
    0 讨论(0)
提交回复
热议问题