How to find common strings among two very large files?

后端未结

关注

 8  1910

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in

相关标签:

8条回答

囚心锁ツ

2021-02-06 07:17

Is there any order to the data in the files? The reason I ask is that though a line by line comparison would take an eternity, going through one file line by line whilst doing a binary search in the other would be much quicker. This can only work if the data is sorted in a particular way though.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-02-06 07:25
To do it in windows, its pretty simple .. lets say , you have two files A and B. 'A' files contains the strings you want to search in file B. just open command prompt and use the following command
```
FINDSTR /G:A B > OUTPUT
```
this command is pretty fast and can compare two files very efficiently. The file OUTPUT will contain the strings common in A and B.

if you want to perform the OR operations (strings in B other than A) then use
```
FINDSTR /V /G:A B > OUTPUT
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2021-02-06 07:27

I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-06 07:30

I would load both files into two database tables so that each string in the file became a row in the table and use SQL queries to find duplicate rows using a join.

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2021-02-06 07:34

Depending on how similar the entries within one file is, it might be possible to create a Trie (not tree) from it. Using this trie you can iterate the other file and check each entry if it is inside the trie.

When you have more than 2 files, iterate over one file and build a new trie from the matches. This way the last trie you have will contain all the matches that are contained in all files.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2021-02-06 07:37
A hash based solution might look like this (in python pseudocode):
```
hashes = dict()
for file in files:
    for line in lines:
        h = md5(line)
        hashes[h] += 1
```
Then loop over again, printing matching lines:
```
for file in files:
    for line in lines:
        h = md5(line)
        if hashes[h] == nfiles:
            print line
            del hashes[h]  # since we only want each once.
```
There are two potential problems.
1. potential hash collisions (which can be mitigated some, but is a risk. )
2. needs to be able to handle a dict (associative array) of size: |uniq lines in all files|
This is O(lines * cost(md5) ).

(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页