问题
How can I tell gawk
to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file?
> cat broad_snps.tab
chrsnpID rsID freq_bin snp_maf gene_count dist_nearest_gene_snpsnap dist_nearest_gene_snpsnap_protein_coding dist_nearest_gene dist_nearest_gene_located_within loci_upstream loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding ID_genes_in_matched_locus friends_ld01 friends_ld02 friends_ld03 friends_ld04 friends_ld05 friends_ld06 friends_ld07 friends_ld08 friends_ld09 -1
10:10001753 10:10001753 7 0.07455 0 98932.0 1045506.0 98932.0 inf 9986766 10039928 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 253.0 103.0 55.0 40.0 35.0 33.031.0 20.0 0.0 -1
10:10001794 10:10001794 41 0.4105 0 98891.0 1045465.0 98891.0 inf 9964948 10071879 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 365.0 299.0 294.0 266.0 168.0 138.58.0 45.0 0.0 -1
10:100023489 10:100023489 10 0.1054 1 4518.0 4518.0 4518.0 4518.0 100023489 100023489 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1
10:100025128 10:100025128 45 0.4543 1 2879.0 2879.0 2879.0 2879.0 100025128 100025128 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 112.0 70.0 3.0 0.0 0.0
The output I want:
chr10 10001752 10001753 CELF2
chr10 10001793 10001794 CELF2
chr10 100023488 100023489 LOXL4
chr10 100025127 100025128 LOXL4
chr10 10002974 10002975 LOXL4
The command I am currently using:
cat broad_snps.tab | tail -n+2 | gawk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2}' | gawk -vOFS="\t" '{print $1, $2, $3}' > broad_SNPs.bed
Returns this:
chr10 10001752 10001753 10
chr10 10001793 10001794 10
chr10 100023488 100023489 10
chr10 100025127 100025128 10
chr10 10002974 10002975 10
chr10 10003391 10003392 10
chr10 100038815 100038816 10
chr10 10008001 10008002 10
chr10 100093012 100093013 10
I'd like to be able to use the ":" delimiter to split up the first column, but I need to use "\t" to pick out the gene ID.
Thanks!
回答1:
awk -F'[\t:]' '{print $1, $2, $4, $17}'
来源:https://stackoverflow.com/questions/50142163/use-awk-with-two-different-delimiters-to-split-and-select-columns