Remove duplicated fasta sequence (bash of biopython method)

问题

Hello I have a fasta file such as :

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

And in this file I would like to remove duplicated sequence and get :

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

Here as you can see the containt after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater wich one I keep.

So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP
For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP
And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6

Does someone have an idea using biopython or a bash method ? Thanks a lot

回答1:

In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:

The following awk will do exactly that:

$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"}
       {seq="";for(i=2;i<=NF;++i) seq=seq toupper($i)}
       !(seq in a){print; a[seq]}' file.fasta

There exists actually a version of awk which has been upgraded to process fasta files:

$ bioawk -c fastx '!(seq in a){print; a[seq]}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

回答2:

You could use this awk one-liner:

$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file

Output:

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:

$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file

Or in pretty-print:

$ awk '
BEGIN {
    FS="\n"
    RS=""
}
{
    if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
        seen[$2,$3]=$0
}
END {
    for(i in seen)
        print (++j>1?ORS:"") seen[i]
}' file

The output order is awk default ie. appears random.

Update If @kvantour's BOTH comments are valid in this case, use this awk:

$ awk '
BEGIN {
    FS="\n"
    RS=""
}
{
    for(i=2;i<=NF;i++)
        k=(i==2?"":k) $i
    if(!(k in seen)||$1~/^[^ ]+_CP /)
        seen[k]=$0
}
END {
    for(i in seen)
        print (++j>1?ORS:"") seen[i]
}' file

Output now:

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

回答3:

Or pure-bash solution (following same log as separate perl solution):

#! /bin/bash

declare -A p
    # Read inbound data into associative array 'p'
while read id ; do
        read s1 ; read s2 ; read s3
        key="$s1:$s2"
        prev=${p[$key]}
        if [[ -z "$prev" || "$id" = %+CP% ]] ; then p[$key]=$id  ; fi
done
    # Print all data
for k in "${!p[@]}" ; do
        echo -e "${p[$k]}\n${k/:/\\n}\n"
done

回答4:

Here's a python program that will provide you with results you are looking for:

import fileinput
import re

seq=""
nameseq={}
seqnames={}
for line in fileinput.input():
    line = line.rstrip() 
    if re.search( "^>", line ):
      if seq:
        nameseq[ id ] = seq
        if seq in seqnames:
          if re.search( "_CP", id ):
            seqnames[ seq ] = id
        else:
          seqnames[ seq ] = id
      seq = ""
      id = line
      continue
    seq += line

for k,v in seqnames.iteritems():
  print(v)
  print(k)

回答5:

Or with perl. Assuming code in m.pl ,can be wrapped into bash script

Hopefully, code will help find medicines, and not develop new viruses :-)

perl m.pl < input-file

! /usr/bin/perl
use strict ;

my %to_id ;
local $/ = "\n\n";
while ( <> ) {
  chomp ;
  my ($id, $s1, $s2 ) = split("\n") ;
  my $key = "$s1\n$s2" ;
  my $prev_id = $to_id{$key} ;
  $to_id{$key} = $id if !defined($prev_id) || $id =~ /_CP/ ;
} ;
print "$to_id{$_}\n$_\n\n" foreach(keys(%to_id)) ;

It's not clear what is the expected order. Perl code will print directly from hash. Can be customized, if needed.

回答6:

Here's a Biopython answer. Be aware that you only have two unique sequences in your example (sequence 6 and 7 only show a character more in the first line but are essentially the same protein sequence as 1).

from Bio import SeqIO

seen = []
records = []
# examples are in sequences.fasta
for record in SeqIO.parse("sequences.fasta", "fasta"):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)

# printing to console
for record in records:
    print(record.name)
    print(record.seq)

# writing to a fasta file
SeqIO.write(records, "unique_sequences.fasta", "fasta")

For more info you can try the biopython cookbook

来源：https://stackoverflow.com/questions/58862586/remove-duplicated-fasta-sequence-bash-of-biopython-method

标签

bash

biopython

fasta