发表新帖

发表新帖

How to find common strings among two very large files?

后端未结

关注

 8  1902

天涯浪人 2021-02-06 07:08

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in

8条回答

南方客 (楼主)

2021-02-06 07:37
A hash based solution might look like this (in python pseudocode):
```
hashes = dict()
for file in files:
    for line in lines:
        h = md5(line)
        hashes[h] += 1
```
Then loop over again, printing matching lines:
```
for file in files:
    for line in lines:
        h = md5(line)
        if hashes[h] == nfiles:
            print line
            del hashes[h]  # since we only want each once.
```
There are two potential problems.
1. potential hash collisions (which can be mitigated some, but is a risk. )
2. needs to be able to handle a dict (associative array) of size: |uniq lines in all files|
This is O(lines * cost(md5) ).

(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

热议问题