Given a string of length n, how would I (pseudo)randomly sample m substrings of size k such that none of the sampled substrings overlap? Most of my sc
This is a recursive approach in Python. At each step, randomly select from among the remaining partitions of the string, then randomly select a substring of length k from the chosen partition. Replace this partition with the split of the partition on the substring chosen. Filter out partitions of length smaller than k, and repeat. The list of substrings returns when there are m of them, or there are no partitions left with length greater than or equal to k.
import random
def f(l, k, m, result=[]):
if len(result) == m or len(l) == 0:
return result
else:
if isinstance(l, str):
l = [l]
part_num = random.randint(0, len(l)-1)
partition = l[part_num]
start = random.randint(0, len(partition)-k)
result.append(partition[start:start+k])
l.remove(partition)
l.extend([partition[:start], partition[start+k:]])
return f([part for part in l if len(part) >= k], k, m, result)
If there is a character that cannot occur in the input, e.g. X
, just:
my $size = 20;
my $count = 20;
my $mark = 'X';
my $input = 'CCACGCATTTTTGTTCATTGTTCTGGCTTCTTACAAGGTTCAGTAGACTTTGTAACACAGTTGTGTCTCTCACAGATTGGCAGATGTTTGGTAAAGGATTGACTTTTCAGCCAACTCATGGGAAAGTGAAATAATGTAAAAAACAGGAAGAATACAGTTTTAGGCCTTTCAAGTGAGGCATGGCTTTCAGCTCTTGGCAAGAACAGGCAAGGAGATGCAAGTTTTAGGACTCTAAGAGGCTAGGCTTTTCAAAGTGCTTCTCTCCCCTTCACCCTCCTTCAGTTACAGCACCAAGCACCACCGAGGTGTTACCTGCAGCCTCACTCTCTACCTGGTTGTGGGATCCTGCCACTTCCTTAACCCACACTGAGTTCCTTGTGGTTCACAGGGTCACACAGAGGGCTGTAGAGATACAAAAGATATATGTGATTTTATATCACCTATCATATGAAGATATATTTATAAAATAGGAAACATATTAACCACTTATCATTTTATATATTTATGGTTTTATGTGTCAAAAATATATTGTTTCATGTATGTATTAAAGGATAAGTATGTATAAGAGGTTTTATAGATGTGTAAAATTATATATTTATACGTATCTTTACAAATTTAAGAATAAAGGAAGGAAAATTCTCAAAGAGGAATTCAGATATCAAGCAGTGCCCTTTGACCAAGAGCCTTGGTTACAACATACCTACAAAAGTGAACTATCATTGAAAGACCTATGGACACTGGATTTCTCTTTCCTTATTTAGAAGGGCAGTCTGTGTCTTGGAAAAGCATACAGTTTGTTGTATCTTGCTGGACAACAGGAGTCA';
if (2*$size*$count-$size-$count >= length($input)) {
die "selection may not complete; choose a shorter length or fewer substrings, or provide a longer input string\n";
}
my @substrings;
while (@substrings < $count) {
my $pos = int rand(length($input)-$size+1);
push @substrings, substr($input, $pos, $size, $mark x $size)
if substr($input, $pos, $size) !~ /\Q$mark/;
}