There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all t
You can combine itertools.groupby
(doc) and heapq.merge
(doc) to iterate through FileA
and FileB
lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s