Bash while loop + cut slow

问题

I am trying to process a file (1.5GB) with a bash loop to iterate each line. I used cut for its simplicity (relative) and ended up with:

while read line
do
    echo "$(echo $line | cut -d' ' -f 2-3)" "$(echo $line | cut -d'"' -f 20)"
done < TEST.log > IDS.log

This is very slow and only does about 2KB/sec. I need something to run a lot faster.

Also, what is the bottleneck here?

回答1:

The bottleneck is likely that you spawn several processes for every line of data. As for a replacement, this awk should be equivalent:

awk '{ split($0, a, "\""); print $2, $3, a[20] }' TEST.log > IDS.log

回答2:

Perl is usually very fast:

perl -nE 'say join " ", (split " ")[1,2], (split /"/)[19]' TEST.log > IDS.log

Perl arrays are indexed starting with 0.

回答3:

The biggest bottleneck here is spinning off the subprocesses for your pipelines. You can get a substantial (read: orders-of-magnitude) performance improvement just by getting rid of the command substitutions and pipelines.

while IFS=$'\x01' read -r ss1 ss2 ss3 _ <&3 && \
      IFS='"' read -r -a quote_separated_fields; do
    printf '%s\n' "${ss2} ${ss3} ${quote_separated_fields[20]}"
done < TEST.log 3< <(tr ' ' $'\x01' <TEST.log) > IDS.log

How does this work?

tr ' ' $'\x01' changes spaces in the input to a low-ASCII character to avoid special-case handling (where read will coalesce runs of whitespace into a single character). Putting this after 3< <(...) puts the output of this operation on file descriptor #3.
IFS=$'\x01' read -r ss1 ss2 ss3 _ <&3 splits a line on those characters, putting the first field into ss1 (which we don't care about), the second into ss2, the third into ss3, and the remainder of the line into _. The <&3 causes this line to read from file descriptor 3.
IFS='"' read -r -a quote_separated_fields splits input on stdin (FD 0) on " characters into an array called quote_separated_fields.

来源：https://stackoverflow.com/questions/28696604/bash-while-loop-cut-slow

标签

bash

cut