I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping ti
Why roll your own if there is heapq.merge()
in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:
from itertools import imap
from operator import itemgetter
import heapq
def extract_timestamp(line):
"""Extract timestamp and convert to a form that gives the
expected result in a comparison
"""
return line.split()[1] # for example
with open("log1.txt") as f1, open("log2.txt") as f2:
sources = [f1, f2]
with open("merged.txt", "w") as dest:
decorated = [
((extract_timestamp(line), line) for line in f)
for f in sources]
merged = heapq.merge(*decorated)
undecorated = imap(itemgetter(-1), merged)
dest.writelines(undecorated)
Every step in the above is "lazy". As I avoid file.readlines()
the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge()
is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap()
, the lazy variant of the map() built-in to undecorate.
(In Python 3 map() has become lazy, so you can use that)
You want to implement a file-based merge sort. Read a line from both files, output the older line, then read another line from that file. Once one of the files is exhausted, output all the remaining lines from the other file.