Grepping a huge file (80GB) any way to speed it up?

前端 未结 5 1914
既然无缘
既然无缘 2020-11-29 15:52
 grep -i -A 5 -B 5 \'db_pd.Clients\'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not

相关标签:
5条回答
  • 2020-11-29 15:52

    Some trivial improvement:

    • Remove the -i option, if you can, case insensitive is quite slow.

    • Replace the . by \.

      A single point is the regex symbol to match any character, which is also slow

    0 讨论(0)
  • 2020-11-29 15:55

    If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

    < eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
    

    Depending on your disks and CPUs it may be faster to read larger blocks:

    < eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
    

    It's not entirely clear from you question, but other options for grep include:

    • Dropping the -i flag.
    • Using the -F flag for a fixed string
    • Disabling NLS with LANG=C
    • Setting a max number of matches with the -m flag.
    0 讨论(0)
  • 2020-11-29 16:02
    < eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  
    

    If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

    0 讨论(0)
  • 2020-11-29 16:15

    Here are a few options:

    1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

    2) Use fgrep because you're searching for a fixed string, not a regular expression.

    3) Remove the -i option, if you don't need it.

    So your command becomes:

    LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
    

    It will also be faster if you copy your file to RAM disk.

    0 讨论(0)
  • 2020-11-29 16:15

    Two lines of attack:

    • are you sure, you need the -i, or do you habe a possibility to get rid of it?
    • Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
    0 讨论(0)
提交回复
热议问题