问题
I have a multifasta file and I need to delete some part of the header for every fasta file. For example:
>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA
I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein " and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB ".
I tried
cut -d '-' -f 1 your_file.fasta > new_file.fasta
and
awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta
But this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.
回答1:
cut
processes all the lines in the same way, but you need to process only a subset of lines, the ones that start with >
. You can try with awk
:
awk -F- '{if ($0 ~ /^>/) {print $1} else {print $0}}' in.fasta > out.fasta
This uses -
as separator and in every line, it checks if it starts with >
. If true, then only print the first field; if false, print everything.
The command can even be shortened a little bit.
回答2:
This is easily done in the following way:
awk -F"-" '/^>/{print $1; next}1' in.fasta > out.fasta
来源:https://stackoverflow.com/questions/54664261/multifasta-header-trimming