How to pick multiple fasta sequences from a genes list

我与影子孤独终老i 提交于 2020-05-09 07:55:08

问题


I have two files

The gene list file looks like this

LOC_Os06g12230.1
Pavir.Ab03005
Pavir.J14065
ChrUn.fgenesh
Sevir.1G325700
LOC_Os02g51280.1
Bradi3g59320
Brast04G017400

Fasta sequence file looks like this

>LOC_Os03g57190.1 pacid=33130570 polypeptide=LOC_Os03g57190.1 locus=LOC_Os03g57190 ID=LOC_Os03g57190.1.MSUv7.0 annot-version=v7.0
ATGGAGGCGGCGGTGGGGGACGGGGAAGGCGGTGGCGGCGGCGGCGGGCGGGGGAAGCGTGGGCGGGGAGGAGGAGGAGG
GGAGATGGTGGAGGCGGTGTGGGGGCAGACGGGGAGTACGGCGTCGCGGATCTACAGGGTGAGGGCGACGGGGGGGAAGG
ACAGGCACAGCAAGGTGTACACGGCGAAGGGAATCCGCGACCGCCGCGTCCGCCTCTCCGTCGCCACCGCCATCCAGTTC
TACGACCTCCAGGACCGCCTCGGCTTCGACCAGCCGAGCAAGGCCATCGAGTGG
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
>LOC_Os05g43760.1 pacid=33158388 polypeptide=LOC_Os05g43760.1 locus=LOC_Os05g43760 ID=LOC_Os05g43760.1.MSUv7.0 annot-version=v7.0
ATGACAAGCAATAACAGCACGAATGAGGAGCTCGGCGGCGGCGGCAGGAAGGCGGCCGACAAGCCGAGCGGCGGCGGCGG
CGCCGCCGCCGCCGTGGCGAGCTCGCGGCACTGGTCGGCGTCGACGGAGTCGCGGATCGTGCGCGTGTCGAGGGTGTTCG
GCGGCAAGGACCGTCACAGCAAGGTGAGGACGGTGAAGGGGCTCCGCGACCGGCGGGTGCGGCTGTCGGTGCCGACGGCG
ATCCAGCTCTACGACCTGCAGGACCGGCTGGGGCTCAGCCAGCCGAGCAAGGTGGTCGACT

if the gene name and header line matches then, sequence has to be pulled out into new file

new file should contain

>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA

I had tried like this

grep -f genelist.txt -A3 fastafile.txt >> newfasta.txt

but different fasta sequences have different lengths,

After pattern match, i want to pick till next '>' symbol appears


回答1:


Could you please try following.

awk '
FNR==NR{
  a[$0]
  next
}
/^>/{
  found=""
}
($2 in a){
  found=1
}
found
' Input_file_gene FS="[> ]" Input_file

Output will be as follows.

>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA



回答2:


The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or whitespaces due. A generic awk for processing fasta, looks like this:

awk '/^>/ && seq { process_sequence_here }
     /^>/{name=$0; seq=""; next}
     {seq = seq $0 }
     END { process_sequence_here }' file.fasta

You can make this a bit easier by introducing a couple of functions:

awk '/^>/ && seq { process_sequence(name_seq) }
     /^>/{name=substr($0,2); seq=""; next}
     {seq = seq $0 }
     END { process_sequence(name,seq) }

     BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
     function print_sequence(name,seq) {
         gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
     }
     function process_sequence(name,seq) { ... }
    ' file.fasta

In case of the OP, the above would read:

awk '(NR==FNR) { a[$0]; next }
     /^>/ && seq { process_sequence(name_seq) }
     /^>/{name=substr($0,2); seq=""; next}
     {seq = seq $0 }
     END { process_sequence(name,seq) }

     BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
     function print_sequence(name,seq) {
         gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
     }
     function process_sequence(name,seq) {
         $0=name; if ($1 in a) print_sequence (name,seq)
     }
    ' list.txt file.fasta

When you process fasta-files with awk, you can always concider to use bioawk. It has all the bells-and-whistles from POSIX awk, but is augmented to easily process FASTA files:

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.




回答3:


This might work for you (GNU sed):

sed -n '/^LOC.*/s##/>&/bb#p' file1 |
sed -n -e ':a' -f - -e 'b;:b;p;n;/^>/ba;bb' file2

Use file1 to grep for the keys in file2. If a key matches, print that line and all following lines until a new key. Repeat.

Alternative using GNU parallel and grep:

parallel -k --pipe -N1 --recstart '>' --cat <file2 \
  'grep -F LOC file1 | grep -qFf - -m1 {} && cat {}'

Another way using csplit, parallel and grep:

csplit -z file2 '/^>/' '{*}'
parallel -k 'grep -wqFf file1 -m1 {} && cat {}' ::: xx?? > outFile
rm xx??



回答4:


There are many ways to do this; a good 'bioinformatics' solution is to use seqtk: https://github.com/lh3/seqtk

seqtk subseq sequences.fa list.txt


来源:https://stackoverflow.com/questions/60769591/how-to-pick-multiple-fasta-sequences-from-a-genes-list

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!