Faster grep function for big (27GB) files

后端 未结 4 424
闹比i
闹比i 2021-02-01 10:05

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file int

4条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-01 10:36

    A few things you can try:

    1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.

    2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.

    3) Use fgrep because you're searching for a fixed string, not a regular expression.

    4) Use -f to make grep read patterns from a file, rather than using a loop.

    5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.

    After making those changes, this is what your script would become:

    awk '{print $1}' input.sam > idsFile.txt
    for z in {a..z}
    do
     for x in {a..z}
     do
      for y in {a..z}
      do
        LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
      done >> output.txt
    

    Also, check out GNU Parallel which is designed to help you run jobs in parallel.

提交回复
热议问题