Find all the numbers in one file that are not in another file in python

前端 未结 5 2047
灰色年华
灰色年华 2021-01-04 02:08

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all t

5条回答
  •  隐瞒了意图╮
    2021-01-04 02:33

    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]
    
    from itertools import groupby
    from heapq import merge
    
    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)
    
    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
        if any(v[1] == 'FileB' for v in g):
            continue
        print(v)
    

    Prints:

    2
    5
    

    EDIT (Reading from files):

    from itertools import groupby
    from heapq import merge
    
    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
    
    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
        if any(v[1] == 2 for v in g):
            continue
        print(v)
    

    Benchmark:

    Generating files with 10_000_000 items:

    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt
    

    The script takes ~10sec to complete:

    real    0m10,656s
    user    0m10,557s
    sys 0m0,076s
    

提交回复
热议问题