How to find common strings among two very large files?

后端 未结 8 1902
天涯浪人
天涯浪人 2021-02-06 07:08

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in

8条回答
  •  南方客
    南方客 (楼主)
    2021-02-06 07:37

    A hash based solution might look like this (in python pseudocode):

    hashes = dict()
    for file in files:
        for line in lines:
            h = md5(line)
            hashes[h] += 1
    

    Then loop over again, printing matching lines:

    for file in files:
        for line in lines:
            h = md5(line)
            if hashes[h] == nfiles:
                print line
                del hashes[h]  # since we only want each once.
    

    There are two potential problems.

    1. potential hash collisions (which can be mitigated some, but is a risk. )
    2. needs to be able to handle a dict (associative array) of size: |uniq lines in all files|

    This is O(lines * cost(md5) ).

    (if people a fuller python implementation, it's pretty easy to write, I don't know java though!).

提交回复
热议问题