Multifasta header trimming

问题

I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein " and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB ".

I tried

 cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

 awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

But this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.

回答1:

cut processes all the lines in the same way, but you need to process only a subset of lines, the ones that start with >. You can try with awk:

awk -F- '{if ($0 ~ /^>/) {print $1} else {print $0}}' in.fasta > out.fasta

This uses - as separator and in every line, it checks if it starts with >. If true, then only print the first field; if false, print everything.

The command can even be shortened a little bit.

回答2:

This is easily done in the following way:

awk -F"-" '/^>/{print $1; next}1' in.fasta > out.fasta

来源：https://stackoverflow.com/questions/54664261/multifasta-header-trimming

标签

bash

awk

Sequence

fasta