问题
I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).
Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.
First I used grep with 1 parameter, then LC_ALL=C
made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel
parallel -j 2 --pipe --block 20M
and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.
BUT.. when I pipe multiple parameters in one grep
parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt
then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).
Is there a faster/better way to solve this problem?
Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.
回答1:
In reflect to the above comments i done another test. Taked my file from md5deep -rZ
command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long).
The
time egrep '100|fixed|strings' md5 >/dev/null
time
real 0m16.888s
user 0m16.714s
sys 0m0.172s
for the
time fgrep -f 100_lines_patt_file md5 >/dev/null
the time is
real 0m1.379s
user 0m1.220s
sys 0m0.158s
Nearly 15times faster as egrep.
So, when you get only 0.3 sec improvment betwen egrep
and fgrep
IMHO thats mean:
- your IO is to slow
The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep
.
回答2:
Interestingly, compressing the log files into .gz format and using zgrep -E reduced the time dramatically. Also it didn't matter whether I searched for 1 pattern or multiple patterns in a single zgrep command, it just worked around ~1 second per 200MB file.
来源:https://stackoverflow.com/questions/17475791/grep-multiple-strings-on-large-files