问题
So I have the following sed
one liner:
sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt
I have many lines that start with either:
S|
T|
#D=
##
H|
Q|
The idea is to not copy the lines starting with one of the first fours and
to replace H|
(at the beginning of lines) by ,H|
and Q|
(at the beginning of lines) by ,,Q|
But now I would need to:
- use the fastest way possible (internet suggests (m)awk is faster than sed)
- read from a .txt.gz file and save the result in a .txt.gz file, avoiding, if possible, the intermediate un-zip/re-zip
there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?
--I use linux --ubuntu
回答1:
Untested, but likely pretty close to this with GNU Parallel.
First make output directory so as not to overwrite any valuable data:
mkdir -p output
Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:
doit(){
echo Processing $1
gzcat "$1" | awk '
/^[ST]\|/ || /^#D=/ || /^##/ {next} # ignore lines starting S|, T| etc
/^H\|/ {print ","} # prefix "H|" with ","
/^Q\|/ {print ",,"} # prefix "Q|" with ",,"
1 # print all other lines
' | gzip > output/"$1"
}
export -f doit
Now process all txt.gz
files in parallel and show progress bar too:
parallel --bar doit ::: *txt.gz
回答2:
Was something like this what you had in mind?
#!/bin/bash
export LC_ALL=C
zcat sample_1.txt.gz | gawk '
$1 !~ /^([ST]\||#D=|##)/ {
switch ($0) {
case /^H\|/:
print "," $0
break
case /^Q\|/:
print ",," $0
break
default:
print $0
}
}' | gzip > sample_2.txt.gz
The export LC_ALL=C
tells your environment you aren't expecting extended characters, and can profoundly speed up execution. zcat
expands and dumps a gz file to stdout. That is piped into gawk
, which checks that the first part of each line does not match the first four character groupings you have in your question. For lines that pass that test, output to stdout (massaged as requested). As gawk
executes, its stdout gets piped into gzip
and written to a .txt.gz file.
It might be possible to use xargs
with the -P
and -n
switches to parallelize your processing, but I think GNU parallel might be easier to work with.
来源:https://stackoverflow.com/questions/50915850/quickest-way-to-select-copy-lines-containing-string-from-huge-txt-gz-file