Extracting info from large structured text files

后端 未结 5 1201
鱼传尺愫
鱼传尺愫 2021-01-15 18:22

I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern \"No.999999999 dd/mm/yyyy ZZZ

相关标签:
5条回答
  • 2021-01-15 18:59

    Another version with only one combined regular expression:

    #!/usr/bin/python
    
    import re
    import pprint
    import sys
    
    class Despacho(object):
        """
        Class to parse each line, applying the regexp and storing the results
        for future use
        """
        #used a dict with the keys instead of functions.
        regexp = re.compile(
            r'No.(?P<processo>[\d]{9})  (?P<data>[\d]{2}/[\d]{2}/[\d]{4})  (?P<despacho>.*)'
            r'|Tit.(?P<titular>.*)'
            r'|Procurador: (?P<procurador>.*)'
            r'|C.N.P.J./C.I.C./N INPI :(?P<documento>.*)'
            r'|Apres.: (?P<apresentacao>.*) ; Nat.: (?P<natureza>.*)'
            r'|Marca: (?P<marca>.*)'
            r'|Clas.Prod/Serv: (?P<classe>.*)'
            r'|\*(?P<complemento>.*)')
    
        simplefields = ('processo', 'data', 'despacho', 'titular', 'procurador',
                        'documento', 'apresentacao', 'natureza', 'marca', 'classe')
    
        def __init__(self):
            """
            'complemento' is the only field that can be multiple in a single
            registry
            """
            self.__dict__ = dict.fromkeys(self.simplefields)
            self.complemento = []
    
        def parse(self, line):
            m = self.regexp.match(line)
            if m:
                gd = dict((k, v) for k, v in m.groupdict().items() if v)
                if 'complemento' in gd:
                    self.complemento.append(gd['complemento'])
                else:
                    self.__dict__.update(gd)
    
        def __repr__(self):
            # defines object printed representation
            return pprint.pformat(self.__dict__)
    
    def process(rpi):
        """
        read data and process each group
        """
        d = None
    
        for line in rpi:
            if line.startswith('No.'):
                if d:
                    yield d
                d = Despacho()
            d.parse(line)
        yield d
    
    def main():
        arquivo = file('rm1972.txt') # file to process
        for desp in process(arquivo):
            print desp # can print directly here.
            print '-' * 20
    
    if __name__ == '__main__':
        main()
    
    0 讨论(0)
  • 2021-01-15 19:06

    It looks good overall, but why do you have the line:

    rpi = (line for line in rpi)
    

    You can already iterate over the file object without this intermediate step.

    0 讨论(0)
  • 2021-01-15 19:12

    It would be easier to help if you had a specific concern. Performance will depend greatly on the efficiency of the particular regex engine you are using. 100K lines in a single file doesn't sound that big, but again it all depends on your environment.

    I use Expresso in my .NET development to test expressions for accuracy and performance. A Google search turned up Kodos, a GUI Python regex authoring tool.

    0 讨论(0)
  • 2021-01-15 19:19

    That is pretty good. Below some suggestions, let me know if you like'em:

    import re
    import pprint
    import sys
    
    class Despacho(object):
        """
        Class to parse each line, applying the regexp and storing the results
        for future use
        """
        #used a dict with the keys instead of functions.
        regexp = {
            ('processo', 
             'data', 
             'despacho'): re.compile(r'No.([\d]{9})  ([\d]{2}/[\d]{2}/[\d]{4})  (.*)'),
            ('titular',): re.compile(r'Tit.(.*)'),
            ('procurador',): re.compile(r'Procurador: (.*)'),
            ('documento',): re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'),
            ('apresentacao',
             'natureza'): re.compile(r'Apres.: (.*) ; Nat.: (.*)'),
            ('marca',): re.compile(r'Marca: (.*)'),
            ('classe',): re.compile(r'Clas.Prod/Serv: (.*)'),
            ('complemento',): re.compile(r'\*(.*)'),
        }
    
        def __init__(self):
            """
            'complemento' is the only field that can be multiple in a single registry
            """
            self.complemento = []
    
    
        def read(self, line):
            for attrs, pattern in Despacho.regexp.iteritems():
                m = pattern.match(line)
                if m:
                    for groupn, attr in enumerate(attrs):
                        # special case complemento:
                        if attr == 'complemento':
                            self.complemento.append(m.group(groupn + 1))
                        else:
                            # set the attribute on the object
                            setattr(self, attr, m.group(groupn + 1))
    
        def __repr__(self):
            # defines object printed representation
            d = {}
            for attrs in self.regexp:
                for attr in attrs:
                    d[attr] = getattr(self, attr, None)
            return pprint.pformat(d)
    
    def process(rpi):
        """
        read data and process each group
        """
        #Useless line, since you're doing a for anyway
        #rpi = (line for line in rpi)
        group = False
    
        for line in rpi:
            if line.startswith('No.'):
                group = True
                d = Despacho()        
    
            if not line.strip() and group: # empty line - end of block
                yield d
                group = False
    
            d.read(line)
    
    def main():
        arquivo = open('rm1972.txt') # file to process
        for desp in process(arquivo):
            print desp # can print directly here.
            print('-' * 20)
        return 0
    
    if __name__ == '__main__':
        main()
    
    0 讨论(0)
  • 2021-01-15 19:22

    I wouldn't use regex here. If you know that your lines will be starting with fixed strings, why not check those strings and write a logic around it?

    for line in open(file):
        if line[0:3]=='No.':
            currIndex='No'
            map['No']=line[4:]
       ....
       ...
       else if line.strip()=='':
           //store the record in the map and clear the map
       else:
          //append line to the last index in map.. this is when the record overflows to the next line.
          Map[currIndex]=Map[currIndex]+"\n"+line 
    

    Consider the above code as just the pseudocode.

    0 讨论(0)
提交回复
热议问题