问题
Given these inputs:
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );
I want to generate:
One thousand length-10 tags
Substitution rate for each position in a tag is 0.003
Yielding output like:
AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags
Is there a compact way to do it in Perl?
I am stuck with the logic of this script as core:
#!/usr/bin/perl
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );
$i = 0;
while ($i < length($init_seq)) {
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
print $base;
}
continue {
$i++;
}
回答1:
As a small optimisation, replace:
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
with
$base = $dna[int(rand 4)];
回答2:
EDIT: Assuming substitution rate is in the range 0.001 to 1.000:
As well as $roll
, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').
Be aware that you may introduce subtle bias unless the properties of your random number generator are known.
回答3:
Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;
my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');
my $evolve = Bio::SeqEvolution::Factory->new (
-rate => 2, # transition/transversion rate
-seq => $seq
-identity => 50 # At least 50% identity with the original
);
my @mutated;
for (1..1000) { push @mutated, $evolve->next_seq }
All 1000 mutated sequences will be stored in the @mutated array, their sequences can be accessed via the seq
method.
回答4:
In the event of a substitution, you want to exclude the current base from the possibilities:
my @other_bases = grep { $_ ne substr($init_seq, $i, 1) } @dna;
$base = @other_bases[int(rand 3)];
Also please see Mitch Wheat's answer for how to implement the substitution rate.
回答5:
I don't know if I understand correctly but I'd do something like this (pseudocode):
digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
# check if we have to mutate
mutate = 1+rand(MAX) <= rate*MAX
if mutate then
# find current A:0 T:1 C:2 G:3
current = digits.find(base[i])
# get a new position
# but ensure that it is not current
new = (j+1+rand(3)) mod 4
base[i] = digits[new]
end if
end for
来源:https://stackoverflow.com/questions/601727/generating-synthetic-dna-sequence-with-substitution-rate