问题
I am trying to process a file (1.5GB) with a bash loop to iterate each line. I used cut
for its simplicity (relative) and ended up with:
while read line
do
echo "$(echo $line | cut -d' ' -f 2-3)" "$(echo $line | cut -d'"' -f 20)"
done < TEST.log > IDS.log
This is very slow and only does about 2KB/sec. I need something to run a lot faster.
Also, what is the bottleneck here?
回答1:
The bottleneck is likely that you spawn several processes for every line of data. As for a replacement, this awk should be equivalent:
awk '{ split($0, a, "\""); print $2, $3, a[20] }' TEST.log > IDS.log
回答2:
Perl is usually very fast:
perl -nE 'say join " ", (split " ")[1,2], (split /"/)[19]' TEST.log > IDS.log
Perl arrays are indexed starting with 0.
回答3:
The biggest bottleneck here is spinning off the subprocesses for your pipelines. You can get a substantial (read: orders-of-magnitude) performance improvement just by getting rid of the command substitutions and pipelines.
while IFS=$'\x01' read -r ss1 ss2 ss3 _ <&3 && \
IFS='"' read -r -a quote_separated_fields; do
printf '%s\n' "${ss2} ${ss3} ${quote_separated_fields[20]}"
done < TEST.log 3< <(tr ' ' $'\x01' <TEST.log) > IDS.log
How does this work?
tr ' ' $'\x01'
changes spaces in the input to a low-ASCII character to avoid special-case handling (whereread
will coalesce runs of whitespace into a single character). Putting this after3< <(...)
puts the output of this operation on file descriptor #3.IFS=$'\x01' read -r ss1 ss2 ss3 _ <&3
splits a line on those characters, putting the first field intoss1
(which we don't care about), the second intoss2
, the third intoss3
, and the remainder of the line into_
. The<&3
causes this line to read from file descriptor 3.IFS='"' read -r -a quote_separated_fields
splits input on stdin (FD 0) on"
characters into an array calledquote_separated_fields
.
来源:https://stackoverflow.com/questions/28696604/bash-while-loop-cut-slow