问题
Hello I have a fasta file such as :
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
And in this file I would like to remove duplicated sequence and get :
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Here as you can see the containt after the > name
for sequence1_CP
, sequence2
and sequence3
is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP
in its name, then I want to keep this one especially. If there is none _CP
in any of them it does not mater wich one I keep.
- So for the first duplicates between
Sequence1_CP
,Sequence2
andSequence3
I keepsequence1_CP
- For the second duplicates between
sequence4_CP
andsequence5
I keepsequence4_CP
- And for the third duplicates between sequence6 and
sequence7
I keep the first onesequence6
Does someone have an idea using biopython or a bash method ? Thanks a lot
回答1:
In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:
The following awk will do exactly that:
$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"}
{seq="";for(i=2;i<=NF;++i) seq=seq toupper($i)}
!(seq in a){print; a[seq]}' file.fasta
There exists actually a version of awk which has been upgraded to process fasta files:
$ bioawk -c fastx '!(seq in a){print; a[seq]}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
回答2:
You could use this awk one-liner:
$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file
Output:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Above relies on observation that the sequences are in order where the _CP
s come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP
sequence is found:
$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file
Or in pretty-print:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
seen[$2,$3]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
The output order is awk default ie. appears random.
Update If @kvantour's BOTH comments are valid in this case, use this awk:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
for(i=2;i<=NF;i++)
k=(i==2?"":k) $i
if(!(k in seen)||$1~/^[^ ]+_CP /)
seen[k]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
Output now:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
回答3:
Or pure-bash solution (following same log as separate perl
solution):
#! /bin/bash
declare -A p
# Read inbound data into associative array 'p'
while read id ; do
read s1 ; read s2 ; read s3
key="$s1:$s2"
prev=${p[$key]}
if [[ -z "$prev" || "$id" = %+CP% ]] ; then p[$key]=$id ; fi
done
# Print all data
for k in "${!p[@]}" ; do
echo -e "${p[$k]}\n${k/:/\\n}\n"
done
回答4:
Here's a python program that will provide you with results you are looking for:
import fileinput
import re
seq=""
nameseq={}
seqnames={}
for line in fileinput.input():
line = line.rstrip()
if re.search( "^>", line ):
if seq:
nameseq[ id ] = seq
if seq in seqnames:
if re.search( "_CP", id ):
seqnames[ seq ] = id
else:
seqnames[ seq ] = id
seq = ""
id = line
continue
seq += line
for k,v in seqnames.iteritems():
print(v)
print(k)
回答5:
Or with perl
. Assuming code in m.pl ,can be wrapped into bash script
Hopefully, code will help find medicines, and not develop new viruses :-)
perl m.pl < input-file
! /usr/bin/perl
use strict ;
my %to_id ;
local $/ = "\n\n";
while ( <> ) {
chomp ;
my ($id, $s1, $s2 ) = split("\n") ;
my $key = "$s1\n$s2" ;
my $prev_id = $to_id{$key} ;
$to_id{$key} = $id if !defined($prev_id) || $id =~ /_CP/ ;
} ;
print "$to_id{$_}\n$_\n\n" foreach(keys(%to_id)) ;
It's not clear what is the expected order. Perl code will print directly from hash. Can be customized, if needed.
回答6:
Here's a Biopython answer. Be aware that you only have two unique sequences in your example (sequence 6 and 7 only show a character more in the first line but are essentially the same protein sequence as 1).
from Bio import SeqIO
seen = []
records = []
# examples are in sequences.fasta
for record in SeqIO.parse("sequences.fasta", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
# printing to console
for record in records:
print(record.name)
print(record.seq)
# writing to a fasta file
SeqIO.write(records, "unique_sequences.fasta", "fasta")
For more info you can try the biopython cookbook
来源:https://stackoverflow.com/questions/58862586/remove-duplicated-fasta-sequence-bash-of-biopython-method