问题
I have two files
The gene list file looks like this
LOC_Os06g12230.1
Pavir.Ab03005
Pavir.J14065
ChrUn.fgenesh
Sevir.1G325700
LOC_Os02g51280.1
Bradi3g59320
Brast04G017400
Fasta sequence file looks like this
>LOC_Os03g57190.1 pacid=33130570 polypeptide=LOC_Os03g57190.1 locus=LOC_Os03g57190 ID=LOC_Os03g57190.1.MSUv7.0 annot-version=v7.0
ATGGAGGCGGCGGTGGGGGACGGGGAAGGCGGTGGCGGCGGCGGCGGGCGGGGGAAGCGTGGGCGGGGAGGAGGAGGAGG
GGAGATGGTGGAGGCGGTGTGGGGGCAGACGGGGAGTACGGCGTCGCGGATCTACAGGGTGAGGGCGACGGGGGGGAAGG
ACAGGCACAGCAAGGTGTACACGGCGAAGGGAATCCGCGACCGCCGCGTCCGCCTCTCCGTCGCCACCGCCATCCAGTTC
TACGACCTCCAGGACCGCCTCGGCTTCGACCAGCCGAGCAAGGCCATCGAGTGG
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
>LOC_Os05g43760.1 pacid=33158388 polypeptide=LOC_Os05g43760.1 locus=LOC_Os05g43760 ID=LOC_Os05g43760.1.MSUv7.0 annot-version=v7.0
ATGACAAGCAATAACAGCACGAATGAGGAGCTCGGCGGCGGCGGCAGGAAGGCGGCCGACAAGCCGAGCGGCGGCGGCGG
CGCCGCCGCCGCCGTGGCGAGCTCGCGGCACTGGTCGGCGTCGACGGAGTCGCGGATCGTGCGCGTGTCGAGGGTGTTCG
GCGGCAAGGACCGTCACAGCAAGGTGAGGACGGTGAAGGGGCTCCGCGACCGGCGGGTGCGGCTGTCGGTGCCGACGGCG
ATCCAGCTCTACGACCTGCAGGACCGGCTGGGGCTCAGCCAGCCGAGCAAGGTGGTCGACT
if the gene name and header line matches then, sequence has to be pulled out into new file
new file should contain
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
I had tried like this
grep -f genelist.txt -A3 fastafile.txt >> newfasta.txt
but different fasta sequences have different lengths,
After pattern match, i want to pick till next '>' symbol appears
回答1:
Could you please try following.
awk '
FNR==NR{
a[$0]
next
}
/^>/{
found=""
}
($2 in a){
found=1
}
found
' Input_file_gene FS="[> ]" Input_file
Output will be as follows.
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
回答2:
The easiest way to process FASTA files with awk, is to build up a variable called name
and a variable called seq
. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or whitespaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { process_sequence_here }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { process_sequence_here }' file.fasta
You can make this a bit easier by introducing a couple of functions:
awk '/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) { ... }
' file.fasta
In case of the OP, the above would read:
awk '(NR==FNR) { a[$0]; next }
/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) {
$0=name; if ($1 in a) print_sequence (name,seq)
}
' list.txt file.fasta
When you process fasta-files with awk, you can always concider to use bioawk. It has all the bells-and-whistles from POSIX awk, but is augmented to easily process FASTA files:
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
回答3:
This might work for you (GNU sed):
sed -n '/^LOC.*/s##/>&/bb#p' file1 |
sed -n -e ':a' -f - -e 'b;:b;p;n;/^>/ba;bb' file2
Use file1
to grep for the keys in file2
. If a key matches, print that line and all following lines until a new key. Repeat.
Alternative using GNU parallel and grep:
parallel -k --pipe -N1 --recstart '>' --cat <file2 \
'grep -F LOC file1 | grep -qFf - -m1 {} && cat {}'
Another way using csplit, parallel and grep:
csplit -z file2 '/^>/' '{*}'
parallel -k 'grep -wqFf file1 -m1 {} && cat {}' ::: xx?? > outFile
rm xx??
回答4:
There are many ways to do this; a good 'bioinformatics' solution is to use seqtk: https://github.com/lh3/seqtk
seqtk subseq sequences.fa list.txt
来源:https://stackoverflow.com/questions/60769591/how-to-pick-multiple-fasta-sequences-from-a-genes-list