There are loads of threads about passing a shell variable to awk, and I\'ve figured that out easily enough, but the variable I want to pass is the column specifier variable (
I'm going to add this as an answer because it does resolve the question as I posed it, notwithstanding Charles' excellent advice about the (myriad) areas I was going wrong.
Altering the above code with Charles' point about the separate awk commands, I can now invoke the following (sorry yes, it's still using paste
)...
#!/bin/bash
keyfile="$1"
filetosort="$2"
indexfield="$3"
paste "$keyfile" <(awk -v field="$indexfield" 'NR==FNR{o[FNR]=$field; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' "$keyfile" "$filetosort")
I was missing the $
before the variable I was calling in the awk command, which is (part) of why my original code wasn't working, as well as not including the awk variable declaration in a single awk call.
Thus, bash sortandmatch.sh keyfile filetosort 3
produces the output I want:
PVCunit2_5 plu1704 PLT_01732 PLT_01732 4etv 4etv_A 39.0 12 0.00032 27.6 >4etv_A Ryanodine receptor 2; phosphorylation, cardiac, metal transport; 1.65A {Mus musculus}
PVCunit2_4 plu1705 PLT_01733 PLT_01733 3j9q 3j9q_A 99.9 7.2e-30 1.9e-34 219.0 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
XVC_pnf15 XBW1_RS06910 XBW1_RS06910 XBW1_RS06910 1fi0 1fi0_A 69.2 1.7 4.4e-05 22.8 >1fi0_A VPR protein, R ORF protein; helix, viral protein; NMR {Synthetic} SCOP: j.11.1.1
PVCcif7 PAU_01999 PAU_01967 PAU_01967 5a3a 5a3a_A 47.5 7.3 0.00019 30.9 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClumT15 PAU_02233 PAU_02192 PAU_02192 1tdp 1tdp_A 22.1 37.0 0.00096 27.2 >1tdp_A Carnobacteriocin B2 immunity protein; four-helix bundle, antimicrobial protein; NMR {Carnobacterium maltaromaticum} SCOP: a.29.8.1
XVC_pnf3 XBW1_RS06850 XBW1_RS06850 XBW1_RS06850 3eaa 3eaa_A 87.7 0.13 3.4e-06 35.7 >3eaa_A EVPC; T6SS, unknown function; 2.79A {Edwardsiella tarda}
PVCunit1_4 afp4 PAU_02778 PAU_02778 3j9q 3j9q_A 99.9 3.6e-29 9.5e-34 214.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCunit2_3 plu1706 PLT_01734 PLT_01734 3j9q 3j9q_A 100.0 1.6e-34 4.3e-39 253.7 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClumt17 PAK_2200 PAK_01998 PAK_01998 3k8p 3k8p_C 34.7 16.0 0.00041 34.1 >3k8p_C DSL1, KLLA0C02695P; intracellular trafficking, DSL1 complex, multisubunit tethering complex, snare proteins; 2.60A {Kluyveromyces lactis}
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
When you pass awk -v a="$field"
, the specification of the awk variable a
is only good for that single awk
command. You can't expect a
to be available in a completely different invocation of awk
.
Thus, you need to put it in-place directly:
$ bashvar="2"
$ echo 'foo bar baz' | awk -v awkvar="$bashvar" '{print $awkvar}'
bar
Or in your case:
field=1
awk -v a="$field" '
NR==FNR {
o[FNR]=$a;
next;
}
{ t[$1] = $0 }
END {
for(x=1; x<=FNR; x++) {
y=o[x]
printf("%s\t%s\n", y, t[y])
}
}' "$keyfile" "$filetosort"
Points of note:
printf
here is emitting both the key and the value, so there's no need to use paste
to put the keyfile
values back in.$a
is used to treat the awk variable a
(assigned from shell variable field
) as a variable name itself, and to perform an indirect reference -- thus, looking up the relevant column number.awk
will be generated by the expansion of $keyfile
-- it could be 0 (if there are no characters in the string not found in IFS); it could be 1, but it could also be a completely unbounded number (input file.txt
would become two arguments, input
and file.txt
; * input * .txt
would have each *
replaced with a list of files).