Merging pre-sorted files without reading everything into memory

后端 未结 2 2084
遥遥无期
遥遥无期 2021-01-03 14:25

I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping ti

相关标签:
2条回答
  • 2021-01-03 14:55

    Why roll your own if there is heapq.merge() in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:

    from itertools import imap
    from operator import itemgetter
    import heapq
    
    def extract_timestamp(line):
        """Extract timestamp and convert to a form that gives the
        expected result in a comparison
        """
        return line.split()[1] # for example
    
    with open("log1.txt") as f1, open("log2.txt") as f2:
        sources = [f1, f2]
        with open("merged.txt", "w") as dest:
            decorated = [
                ((extract_timestamp(line), line) for line in f)
                for f in sources]
            merged = heapq.merge(*decorated)
            undecorated = imap(itemgetter(-1), merged)
            dest.writelines(undecorated)
    

    Every step in the above is "lazy". As I avoid file.readlines() the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge() is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap(), the lazy variant of the map() built-in to undecorate.

    (In Python 3 map() has become lazy, so you can use that)

    0 讨论(0)
  • 2021-01-03 14:56

    You want to implement a file-based merge sort. Read a line from both files, output the older line, then read another line from that file. Once one of the files is exhausted, output all the remaining lines from the other file.

    0 讨论(0)
提交回复
热议问题