I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping ti
Why roll your own if there is heapq.merge()
in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:
from itertools import imap
from operator import itemgetter
import heapq
def extract_timestamp(line):
"""Extract timestamp and convert to a form that gives the
expected result in a comparison
"""
return line.split()[1] # for example
with open("log1.txt") as f1, open("log2.txt") as f2:
sources = [f1, f2]
with open("merged.txt", "w") as dest:
decorated = [
((extract_timestamp(line), line) for line in f)
for f in sources]
merged = heapq.merge(*decorated)
undecorated = imap(itemgetter(-1), merged)
dest.writelines(undecorated)
Every step in the above is "lazy". As I avoid file.readlines()
the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge()
is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap()
, the lazy variant of the map() built-in to undecorate.
(In Python 3 map() has become lazy, so you can use that)