发表新帖

发表新帖

grep -f alternative for huge files

前端未结

关注

 4  1775

慢半拍i 2021-02-04 09:41

grep -F -f file1  file2

file1 is 90 Mb (2.5 million lines, one word per line)

file2 is 45 Gb

That command doesn\'t actually produce a

4条回答

灰色年华 (楼主)

2021-02-04 10:34

I don't think there is an easy solution.

Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.

A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.

Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.

Your best bet is probably taking a step back and rethinking your problem.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题