Grep multiple strings on large files

你说的曾经没有我的故事 提交于 2020-01-07 05:41:20

问题


I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).

Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.

First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel

parallel -j 2 --pipe --block 20M

and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.

BUT.. when I pipe multiple parameters in one grep

parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt

then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).

Is there a faster/better way to solve this problem?

Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.


回答1:


In reflect to the above comments i done another test. Taked my file from md5deep -rZ command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long).

The

time egrep '100|fixed|strings' md5 >/dev/null

time

real    0m16.888s
user    0m16.714s
sys     0m0.172s

for the

time fgrep -f 100_lines_patt_file md5 >/dev/null

the time is

real    0m1.379s
user    0m1.220s
sys     0m0.158s

Nearly 15times faster as egrep.

So, when you get only 0.3 sec improvment betwen egrep and fgrep IMHO thats mean:

  • your IO is to slow

The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep.




回答2:


Interestingly, compressing the log files into .gz format and using zgrep -E reduced the time dramatically. Also it didn't matter whether I searched for 1 pattern or multiple patterns in a single zgrep command, it just worked around ~1 second per 200MB file.



来源:https://stackoverflow.com/questions/17475791/grep-multiple-strings-on-large-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!