I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in
Is there any order to the data in the files? The reason I ask is that though a line by line comparison would take an eternity, going through one file line by line whilst doing a binary search in the other would be much quicker. This can only work if the data is sorted in a particular way though.
To do it in windows, its pretty simple .. lets say , you have two files A and B. 'A' files contains the strings you want to search in file B. just open command prompt and use the following command
FINDSTR /G:A B > OUTPUT
this command is pretty fast and can compare two files very efficiently. The file OUTPUT will contain the strings common in A and B.
if you want to perform the OR operations (strings in B other than A) then use
FINDSTR /V /G:A B > OUTPUT
I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.
I would load both files into two database tables so that each string in the file became a row in the table and use SQL queries to find duplicate rows using a join.
Depending on how similar the entries within one file is, it might be possible to create a Trie (not tree) from it. Using this trie you can iterate the other file and check each entry if it is inside the trie.
When you have more than 2 files, iterate over one file and build a new trie from the matches. This way the last trie you have will contain all the matches that are contained in all files.
A hash based solution might look like this (in python pseudocode):
hashes = dict()
for file in files:
for line in lines:
h = md5(line)
hashes[h] += 1
Then loop over again, printing matching lines:
for file in files:
for line in lines:
h = md5(line)
if hashes[h] == nfiles:
print line
del hashes[h] # since we only want each once.
There are two potential problems.
This is O(lines * cost(md5) ).
(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).