问题
I have a multifasta file, it looks like this:
>NP_001002156.1
MKTAVDRRKLDLLYSRYKDPQDENKIGVDGIQQFCDDLMLDPASVSVLIVAWKFRAATQCEFSRQEFLDG
MTDLGCDSPEKLKSLLPRLEQELKDSGKFRDFYRFTFSFAKSPGQKCLDLEMAVAYWNLILSGRFKFLGL
WNTFLLEHHKKSIPKDTWNLLLDFGNMIADDMSNYAEEGAWPVLIDDFVEFARPIVTAENLQTL
>NP_957070.2
MAKDAGLKETNGEIKLFINQSPGKAAGVLQLLTVHPASITTVKQILPKTLTVTGAHVLPHMVVSTPQRPT
IPVLLTSPHTPTAQTQQESSPWSSGHCRRADKSGKGLRHFSMKVCEKVQKKVVTSYNEVADELVQEFSSA
DHSSISPNDAVSSCHVYDQKNIRRRVYDALNVLMAMNIISKDKKEIKWIGFPTNSAQECEDLKAERQRRQ
ERIKQKQSQLQELIVQQIAFKNLVQRNREVEQQSKRSPSANTIIQLPFIIINTSKKTIIDCSISNDKFEY
LFNFDSMFEIHDDVEVLKRLGLALGLESGRCSAEQMKIATSLVSKALQPYVTEMAQGSVNQPMDFSHVAA
ERRASSSTSSRVETPTSLMEEDEEDEEEDYEEEDD
>NP_123456.1
MALLLLLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
...
Although there is a great python script to handle motif searches in a multifasta file (https://www.biostars.org/p/14305/), if pattern "[KHR]{3}" was used, it would return only motif list and many empty results:
>NP_001002156.1
:['RRK']
>NP_001002156.1
:[]
>NP_001002156.1
:['HHK']
>NP_957070.2
:[]
>NP_957070.2
:['RRR']
...
and some motif (HKK) was leaked in the same sequence.
Here I found another python script:
#coding:utf-8
import re
pattern = "[KHR]{3}"
with open('seq.fasta') as fh:
fh.readline()
seq = ""
for line in fh:
seq += line.strip()
rgx = re.compile(pattern)
result = rgx.search(seq)
patternfound = result.group()
span = result.span()
leftpos = span[0]-10
if leftpos < 0:
leftpos = 0
print(seq[leftpos:span[0]].lower() + patternfound + seq[span[1]:span[1]+10].lower())
it returns the first matched motif found in a context (forward 10 amino acids after the matched motif, and backward 10 before the matched motif) for only one fasta (the 1st one) sequence, for the first fasta sequence NP_001002156.1 using the scirpt, the returned result:
mktavdRRKldllysrykd
but it has no file header">NP_001002156.1" and other 2 motifs in context were all ommitted:
glwntfllehHHKksipkdtwnl
lwntfllehhHKKsipkdtwnll
Here, I want the desired script to return matched motif with its postition in a context of each fasta sequence in the multifasta file, and it would present the results as following:
>NP_001002156.1_matchnumber_1_(7~9)
mktavdrRRKldllysrykd
>NP_001002156.1_matchnumber_2_(148~150)
glwntfllehHHKksipkdtwnl
>NP_001002156.1_matchnumber_3_(149~151)
lwntfllehhHKKsipkdtwnll
>NP_957070.2_matchnumber_1_(163~165)
chvydqknirRRRvydalnvlma
>NP_123456.1
no match found
Note: The positon of matched pattern is not the position of context.
Anyone could help me? Thanks in advance.
回答1:
The "motif" here is any three-long combination of [HKR] characters; motifs may overlap.
The overlapping is resolved by using a "lookahead" in the regex. See details below. Neither of quoted or shown resources seem to handle that and I don't see how they would catch overlapping motifs.
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 fasta-file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my ($seq, $seq_name);
while (<$fh>) {
chomp;
if (/^>(.*)/) {
# Process the previous assembled sequence
if ($seq) {
proc_seq($seq_name, $seq);
$seq = '';
}
$seq_name = $1;
next;
}
$seq .= $_;
}
# Process the last one
proc_seq($seq_name, $seq);
sub proc_seq {
my ($seq_name, $seq, $multiline) = @_;
# Build output in the loop, as motifs are found. By default, print all
# output for one seq_name in one line. To print each motif on its own
# line instead, invoke this sub with a true third argument (1 will do).
my $output = ">$seq_name";
my $cnt = 0;
while ($seq =~ /([HKR])(?=([HKR]{2}))/g) {
++$cnt;
my $motif = $1 . $2;
my $pos = pos($seq);
my $pre_context = ($pos >= 11)
? substr($seq, $pos-11, 10)
: substr($seq, 0, $pos-1);
my $post_context = substr $seq, $pos+2, 10;
$output .= " n$cnt($pos~" . ($pos+2) . ") ";
$output .= "\n" if $multiline;
$output .= lc($pre_context) . $motif . lc($post_context);
}
say ($cnt > 0 ? $output : $output . ' no match found');
}
Note on the regex: we need a lookahead for the second and third character in order to be able to catch the overlapping motifs as well.
An example. There is HHKK
in the first sequence, with overlapping motifs HHK
and HKK
. If the regex matches HHK
using /[HKR]{3}/
then after that the position of the regex engine in the string is after the first K
, as it "consumed" HHK
. So all it can see next is just one K
and so there is no [HKR]{3}
to match next, and it thus misses the next motif.
So, instead, I match only one letter and do a "lookahead" for the next two. Then after matching H
(and "seeing" that there is indeed HK
following) only one letter is consumed and the engine got past only that first H
, and it is positioned before the second H
for the next match. Now it will be able to next match the HKK
, in the same manner (and so it can keep matching even multiply overlapping motifs).
This identifies everything indicated in the desired output (which has a typo); note the change in the requirements in the comment, to print all motifs for one sequence on one line. So it prints
>NP_001002156.1 n1(7~9) mktavdRRKldllysrykd n2(148~150) lglwntflleHHKksipkdtwnl n3(149~151) glwntfllehHKKsipkdtwnll >NP_957070.2 n1(163~165) schvydqkniRRRvydalnvlma >NP_bogus_with_no_motifs no match found
with all motifs for the same sequence name on one line, as wanted. I've added a bogus line to input, with no motifs, to test the no match found
addition; this drew the last line in the output above.
There is still an option to print each motif on a separate line, as was originally wanted: invoke the proc_seq
function with an additional, third, argument which is true, like
proc_seq($seq_name, $seq, 1)
and then it'll print
>NP_001002156.1 n1(7~9) mktavdRRKldllysrykd n2(148~150) lglwntflleHHKksipkdtwnl n3(149~151) glwntfllehHKKsipkdtwnll >NP_957070.2 n1(163~165) schvydqkniRRRvydalnvlma >NP_bogus_with_no_motifs no match found
来源:https://stackoverflow.com/questions/54140487/find-all-patterns-in-a-multifasta-file-including-overlapping-motifs