I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in
A hash based solution might look like this (in python pseudocode):
hashes = dict()
for file in files:
for line in lines:
h = md5(line)
hashes[h] += 1
Then loop over again, printing matching lines:
for file in files:
for line in lines:
h = md5(line)
if hashes[h] == nfiles:
print line
del hashes[h] # since we only want each once.
There are two potential problems.
This is O(lines * cost(md5) ).
(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).