Find all the numbers in one file that are not in another file in python

前端未结

关注

 5  2047

灰色年华 2021-01-04 02:08

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all t

5条回答

隐瞒了意图╮ (楼主)

2021-01-04 02:33

You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]

from itertools import groupby
from heapq import merge

gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
        continue
    print(v)

Prints:

2
5

EDIT (Reading from files):

from itertools import groupby
from heapq import merge

gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
        continue
    print(v)

Benchmark:

Generating files with 10_000_000 items:

seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt

The script takes ~10sec to complete:

real    0m10,656s
user    0m10,557s
sys 0m0,076s

0 讨论(0)

查看其它5个回答