Processing repeatedly structured text file with python

后端 未结 3 424
有刺的猬
有刺的猬 2020-12-10 09:45

I have a big text file structured in blocks like:

Student = {
        PInfo = {
                ID   = 0001;
            Name.First = \"Joe\";
            Na         


        
相关标签:
3条回答
  • 2020-12-10 10:02

    To parse the file you could define a grammar that describes your input format and use it to generate a parser.

    There are many language parsers in Python. For example, you could use Grako that takes grammars in a variation of EBNF as input, and outputs memoizing PEG parsers in Python.

    To install Grako, run pip install grako.

    Here's grammar for your format using Grako's flavor of EBNF syntax:

    (* a file is zero or more records *)
    file = { record }* $;
    record = name '=' value ';' ;
    name = /[A-Z][a-zA-Z0-9.]*/ ;
    value = object | integer | string ;
    (* an object contains one or more records *)
    object = '{' { record }+ '}' ;
    integer = /[0-9]+/ ;
    string = '"' /[^"]*/ '"';
    

    To generate parser, save the grammar to a file e.g., Structured.ebnf and run:

    $ grako -o structured_parser.py Structured.ebnf
    

    It creates structured_parser module that can be used to extract the student information from the input:

    #!/usr/bin/env python
    from structured_parser import StructuredParser
    
    class Semantics(object):
        def record(self, ast):
            # record = name '=' value ';' ;
            # value = object | integer | string ;
            return ast[0], ast[2] # name, value
        def object(self, ast):
            # object = '{' { record }+ '}' ;
            return dict(ast[1])
        def integer(self, ast):
            # integer = /[0-9]+/ ;
            return int(ast)
        def string(self, ast):
            # string = '"' /[^"]*/ '"';
            return ast[1]
    
    with open('input.txt') as file:
        text = file.read()
    parser = StructuredParser()
    ast = parser.parse(text, rule_name='file', semantics=Semantics())
    students = [value for name, value in ast if name == 'Student']
    d = {'{0[Name.First]} {0[Name.Last]}'.format(s['PInfo']):
         dict(School=s['School'], Zip=s['Address']['Zip'])
         for s in students}
    from pprint import pprint
    pprint(d)
    

    Output

    {'Joe Burger': {'School': u'West High', 'Zip': 12345},
     'John Smith': {'School': u'East High', 'Zip': 12346}}
    
    0 讨论(0)
  • 2020-12-10 10:08

    For such thing, I use Marpa::R2, a Perl interface to Marpa, a general BNF parser. It allows decribing the text as a grammar rules and parse them to a tree of arrays (parse tree). You can then traverse the tree to save the results as a hash of hashes (hash is perl for python's dictionary) or use it as is.

    I cooked a working example using your input: parser, result tree.

    Hope this helps.

    P.S. Example of ast_traverse(): Parse values from a block of text based on specific keys

    0 讨论(0)
  • 2020-12-10 10:25

    it's not json, but similar structured. you should be able to reformat it into json.

    1. "=" -> ":"
    2. quote all keys with '"'
    3. ";" -> ","
    4. remove all "," which are followed by a "}"
    5. put it in curly braces
    6. parse it with json.loads
    0 讨论(0)
提交回复
热议问题