Performance issue with parsing large log files (~5gb) using awk, grep, sed

后端 未结 4 528
孤城傲影
孤城傲影 2021-02-06 04:47

I am currently dealing with log files with sizes approx. 5gb. I\'m quite new to parsing log files and using UNIX bash, so I\'ll try to be as precise as possible. While searchi

4条回答
  •  再見小時候
    2021-02-06 05:06

    You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:

    time fgrep '2064351200' example.log >/dev/null
    time egrep '2064351200' example.log >/dev/null
    time sed -e '/2064351200/!d' example.log >/dev/null
    time awk '/2064351200/ {print}' example.log >/dev/null
    

    Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.

    As @Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:

    compress -c example2.log >example2.log.Z
    time zgrep '2064351200' example2.log.Z >/dev/null
    gzip -c example2.log >example2.log.gz
    time zgrep '2064351200' example2.log.gz >/dev/null
    bzip2 -k example.log
    time bzgrep '2064351200' example.log.bz2 >/dev/null
    

    I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.

    Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.

    A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.

    tl;dr TEST ALL THE THINGS!

    EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:

    # Precompress:
    gzip -v -c example.log >example.log.gz
    compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')
    
    # Search the compressed file + recent additions:
    { gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'
    

    If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:

    # Prescan for a particular request (repeat for each request you'll be working with):
    gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log
    
    # Search the prescanned file + recent additions:
    { cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'
    

提交回复
热议问题