How to find common strings among two very large files?

后端 未结 8 1900
天涯浪人
天涯浪人 2021-02-06 07:08

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in

相关标签:
8条回答
  • 2021-02-06 07:17

    Is there any order to the data in the files? The reason I ask is that though a line by line comparison would take an eternity, going through one file line by line whilst doing a binary search in the other would be much quicker. This can only work if the data is sorted in a particular way though.

    0 讨论(0)
  • 2021-02-06 07:25

    To do it in windows, its pretty simple .. lets say , you have two files A and B. 'A' files contains the strings you want to search in file B. just open command prompt and use the following command

    FINDSTR /G:A B > OUTPUT
    

    this command is pretty fast and can compare two files very efficiently. The file OUTPUT will contain the strings common in A and B.

    if you want to perform the OR operations (strings in B other than A) then use

    FINDSTR /V /G:A B > OUTPUT
    
    0 讨论(0)
  • 2021-02-06 07:27

    I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

    0 讨论(0)
  • 2021-02-06 07:30

    I would load both files into two database tables so that each string in the file became a row in the table and use SQL queries to find duplicate rows using a join.

    0 讨论(0)
  • 2021-02-06 07:34

    Depending on how similar the entries within one file is, it might be possible to create a Trie (not tree) from it. Using this trie you can iterate the other file and check each entry if it is inside the trie.

    When you have more than 2 files, iterate over one file and build a new trie from the matches. This way the last trie you have will contain all the matches that are contained in all files.

    0 讨论(0)
  • 2021-02-06 07:37

    A hash based solution might look like this (in python pseudocode):

    hashes = dict()
    for file in files:
        for line in lines:
            h = md5(line)
            hashes[h] += 1
    

    Then loop over again, printing matching lines:

    for file in files:
        for line in lines:
            h = md5(line)
            if hashes[h] == nfiles:
                print line
                del hashes[h]  # since we only want each once.
    

    There are two potential problems.

    1. potential hash collisions (which can be mitigated some, but is a risk. )
    2. needs to be able to handle a dict (associative array) of size: |uniq lines in all files|

    This is O(lines * cost(md5) ).

    (if people a fuller python implementation, it's pretty easy to write, I don't know java though!).

    0 讨论(0)
提交回复
热议问题