How to find common strings among two very large files?

后端 未结 8 1883
天涯浪人
天涯浪人 2021-02-06 07:08

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn\'t have spaces in

相关标签:
8条回答
  • 2021-02-06 07:38

    I'd do it as follows (for any number of files):

    • Sort just 1 file (#1).
    • Walk through each line of the next file (#2) and do a binary search on the #1 file (based on the number of lines).
    • If you find the string; write it on another temp file (#temp1).
    • After you finished with #2, sort #temp1 go to #3 and do the same search but this time on #temp1, not #1, which should take much less than the first one as this only has repeated lines.
    • Repeat this process with new temporary files, deleting previous #temp files. Each iteration should take less and less, as the number of repeated lines diminishes.
    0 讨论(0)
  • 2021-02-06 07:43

    You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

    sort file1 | uniq > output
    sort file2 | uniq >> output
    sort file3 | uniq >> output
    ...
    sort output | uniq -d
    
    0 讨论(0)
提交回复
热议问题