Multifasta header trimming

烈酒焚心 提交于 2019-12-13 09:18:51

问题


I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein " and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB ".

I tried

 cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

 awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

But this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.


回答1:


cut processes all the lines in the same way, but you need to process only a subset of lines, the ones that start with >. You can try with awk:

awk -F- '{if ($0 ~ /^>/) {print $1} else {print $0}}' in.fasta > out.fasta

This uses - as separator and in every line, it checks if it starts with >. If true, then only print the first field; if false, print everything.

The command can even be shortened a little bit.




回答2:


This is easily done in the following way:

awk -F"-" '/^>/{print $1; next}1' in.fasta > out.fasta


来源:https://stackoverflow.com/questions/54664261/multifasta-header-trimming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!