How to go through blocks of lines separated by an empty line? The file looks like the following:
ID: 1
Name: X
FamilyN: Y
Age: 20
ID: 2
Name: H
FamilyN: F
A
import itertools
# Assuming input in file input.txt
data = open('input.txt').readlines()
records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]
# You can change output to generator by
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)
# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]
#You can iterate and change the order of elements in the way you want
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output
Use a dict, namedtuple, or custom class to store each attribute as you come across it, then append the object to a list when you reach a blank line or EOF.
Use a generator.
def blocks( iterable ):
accumulator= []
for line in iterable:
if start_pattern( line ):
if accumulator:
yield accumulator
accumulator= []
# elif other significant patterns
else:
accumulator.append( line )
if accumulator:
yield accumulator
If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:
import sys
import re
import os
import mmap
block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)
filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)
for block_match in block_expr.finditer(contents):
print block_match.group()
The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. And the find_iter()
method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall()
does).
I do think this solution is overkill for this use case however (still: it's a nice trick to know...)
Here's another way, using itertools.groupby.
The function groupy
iterates through lines of the file and calls isa_group_separator(line)
for each line
. isa_group_separator
returns either True or False (called the key
), and itertools.groupby
then groups all the consecutive lines that yielded the same True or False result.
This is a very convenient way to collect lines into groups.
import itertools
def isa_group_separator(line):
return line=='\n'
with open('data_file') as f:
for key,group in itertools.groupby(f,isa_group_separator):
# print(key,list(group)) # uncomment to see what itertools.groupby does.
if not key:
data={}
for item in group:
field,value=item.split(':')
value=value.strip()
data[field]=value
print('{FamilyN} {Name} {Age}'.format(**data))
# Y X 20
# F H 23
# Y S 13
# Z M 25
import re
result = re.findall(
r"""(?mx) # multiline, verbose regex
^ID:.*\s* # Match ID: and anything else on that line
Name:\s*(.*)\s* # Match name, capture all characters on this line
FamilyN:\s*(.*)\s* # etc. for family name
Age:\s*(.*)$ # and age""",
subject)
Result will then be
[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]
which can be trivially changed into whatever string representation you want.