Use awk with two different delimiters to split and select columns

问题

How can I tell gawk to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file?

> cat broad_snps.tab

  chrsnpID  rsID    freq_bin    snp_maf gene_count  dist_nearest_gene_snpsnap   dist_nearest_gene_snpsnap_protein_coding    dist_nearest_gene   dist_nearest_gene_located_within    loci_upstream   loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding  ID_nearest_gene ID_nearest_gene_located_within  HGNC_nearest_gene_snpsnap   HGNC_nearest_gene_snpsnap_protein_coding    flag_snp_within_gene    flag_snp_within_gene_protein_coding ID_genes_in_matched_locus   friends_ld01    friends_ld02    friends_ld03    friends_ld04    friends_ld05    friends_ld06    friends_ld07    friends_ld08    friends_ld09    -1    
  10:10001753   10:10001753 7   0.07455 0   98932.0 1045506.0   98932.0 inf 9986766 10039928    ENSG00000224788 ENSG00000048740 ENSG00000224788         CELF2   False   False       253.0   103.0   55.0    40.0    35.0    33.031.0    20.0    0.0 -1  
  10:10001794   10:10001794 41  0.4105  0   98891.0 1045465.0   98891.0 inf 9964948 10071879    ENSG00000224788 ENSG00000048740 ENSG00000224788         CELF2   False   False       365.0   299.0   294.0   266.0   168.0   138.58.0    45.0    0.0 -1  
  10:100023489  10:100023489    10  0.1054  1   4518.0  4518.0  4518.0  4518.0  100023489   100023489   ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   True    True    ENSG00000138131 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1  
  10:100025128  10:100025128    45  0.4543  1   2879.0  2879.0  2879.0  2879.0  100025128   100025128   ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   True    True    ENSG00000138131 112.0   70.0    3.0 0.0 0.0

The output I want:

chr10   10001752    10001753    CELF2
chr10   10001793    10001794    CELF2
chr10   100023488   100023489   LOXL4
chr10   100025127   100025128   LOXL4
chr10   10002974    10002975    LOXL4

The command I am currently using:

cat broad_snps.tab | tail -n+2 |  gawk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2}' | gawk -vOFS="\t" '{print $1, $2, $3}' > broad_SNPs.bed

Returns this:

chr10   10001752    10001753    10
chr10   10001793    10001794    10
chr10   100023488   100023489   10
chr10   100025127   100025128   10
chr10   10002974    10002975    10
chr10   10003391    10003392    10
chr10   100038815   100038816   10
chr10   10008001    10008002    10
chr10   100093012   100093013   10

I'd like to be able to use the ":" delimiter to split up the first column, but I need to use "\t" to pick out the gene ID.

Thanks!

回答1:

awk -F'[\t:]' '{print $1, $2, $4, $17}'

来源：https://stackoverflow.com/questions/50142163/use-awk-with-two-different-delimiters-to-split-and-select-columns

标签

awk

bioinformatics

gawk