Python: How to loop through blocks of lines

后端 未结 10 1324
青春惊慌失措
青春惊慌失措 2020-11-29 07:42

How to go through blocks of lines separated by an empty line? The file looks like the following:

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
A         


        
相关标签:
10条回答
  • 2020-11-29 08:17

    import itertools

    # Assuming input in file input.txt
    data = open('input.txt').readlines()
    
    records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
    output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]
    
    # You can change output to generator by    
    output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)
    
    # output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
    #You can iterate and change the order of elements in the way you want    
    # [(elem[1], elem[0], elem[2]) for elem in output] as required in your output
    
    0 讨论(0)
  • 2020-11-29 08:19

    Use a dict, namedtuple, or custom class to store each attribute as you come across it, then append the object to a list when you reach a blank line or EOF.

    0 讨论(0)
  • 2020-11-29 08:20

    Use a generator.

    def blocks( iterable ):
        accumulator= []
        for line in iterable:
            if start_pattern( line ):
                if accumulator:
                    yield accumulator
                    accumulator= []
            # elif other significant patterns
            else:
                accumulator.append( line )
         if accumulator:
             yield accumulator
    
    0 讨论(0)
  • 2020-11-29 08:23

    If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:

    import sys
    import re
    import os
    import mmap
    
    block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)
    
    filepath = sys.argv[1]
    fp = open(filepath)
    contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)
    
    for block_match in block_expr.finditer(contents):
        print block_match.group()
    

    The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. And the find_iter() method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall() does).

    I do think this solution is overkill for this use case however (still: it's a nice trick to know...)

    0 讨论(0)
  • 2020-11-29 08:26

    Here's another way, using itertools.groupby. The function groupy iterates through lines of the file and calls isa_group_separator(line) for each line. isa_group_separator returns either True or False (called the key), and itertools.groupby then groups all the consecutive lines that yielded the same True or False result.

    This is a very convenient way to collect lines into groups.

    import itertools
    
    def isa_group_separator(line):
        return line=='\n'
    
    with open('data_file') as f:
        for key,group in itertools.groupby(f,isa_group_separator):
            # print(key,list(group))  # uncomment to see what itertools.groupby does.
            if not key:
                data={}
                for item in group:
                    field,value=item.split(':')
                    value=value.strip()
                    data[field]=value
                print('{FamilyN} {Name} {Age}'.format(**data))
    
    # Y X 20
    # F H 23
    # Y S 13
    # Z M 25
    
    0 讨论(0)
  • 2020-11-29 08:26
    import re
    result = re.findall(
        r"""(?mx)           # multiline, verbose regex
        ^ID:.*\s*           # Match ID: and anything else on that line 
        Name:\s*(.*)\s*     # Match name, capture all characters on this line
        FamilyN:\s*(.*)\s*  # etc. for family name
        Age:\s*(.*)$        # and age""", 
        subject)
    

    Result will then be

    [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]
    

    which can be trivially changed into whatever string representation you want.

    0 讨论(0)
提交回复
热议问题