Merging pre-sorted files without reading everything into memory

后端 未结 2 2082
遥遥无期
遥遥无期 2021-01-03 14:25

I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping ti

2条回答
  •  囚心锁ツ
    2021-01-03 14:55

    Why roll your own if there is heapq.merge() in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:

    from itertools import imap
    from operator import itemgetter
    import heapq
    
    def extract_timestamp(line):
        """Extract timestamp and convert to a form that gives the
        expected result in a comparison
        """
        return line.split()[1] # for example
    
    with open("log1.txt") as f1, open("log2.txt") as f2:
        sources = [f1, f2]
        with open("merged.txt", "w") as dest:
            decorated = [
                ((extract_timestamp(line), line) for line in f)
                for f in sources]
            merged = heapq.merge(*decorated)
            undecorated = imap(itemgetter(-1), merged)
            dest.writelines(undecorated)
    

    Every step in the above is "lazy". As I avoid file.readlines() the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge() is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap(), the lazy variant of the map() built-in to undecorate.

    (In Python 3 map() has become lazy, so you can use that)

提交回复
热议问题