Random sampling of non-overlapping substrings of length k

前端 未结 2 1661
野的像风
野的像风 2021-02-10 06:28

Given a string of length n, how would I (pseudo)randomly sample m substrings of size k such that none of the sampled substrings overlap? Most of my sc

相关标签:
2条回答
  • 2021-02-10 06:40

    This is a recursive approach in Python. At each step, randomly select from among the remaining partitions of the string, then randomly select a substring of length k from the chosen partition. Replace this partition with the split of the partition on the substring chosen. Filter out partitions of length smaller than k, and repeat. The list of substrings returns when there are m of them, or there are no partitions left with length greater than or equal to k.

    import random
    
    def f(l, k, m, result=[]):
        if len(result) == m or len(l) == 0:
            return result
        else:
            if isinstance(l, str):
                l = [l]
            part_num = random.randint(0, len(l)-1)
            partition = l[part_num]
            start = random.randint(0, len(partition)-k)
            result.append(partition[start:start+k])
            l.remove(partition)
            l.extend([partition[:start], partition[start+k:]])
            return f([part for part in l if len(part) >= k], k, m, result)
    
    0 讨论(0)
  • 2021-02-10 07:01

    If there is a character that cannot occur in the input, e.g. X, just:

    my $size = 20;
    my $count = 20;
    my $mark = 'X';
    my $input = 'CCACGCATTTTTGTTCATTGTTCTGGCTTCTTACAAGGTTCAGTAGACTTTGTAACACAGTTGTGTCTCTCACAGATTGGCAGATGTTTGGTAAAGGATTGACTTTTCAGCCAACTCATGGGAAAGTGAAATAATGTAAAAAACAGGAAGAATACAGTTTTAGGCCTTTCAAGTGAGGCATGGCTTTCAGCTCTTGGCAAGAACAGGCAAGGAGATGCAAGTTTTAGGACTCTAAGAGGCTAGGCTTTTCAAAGTGCTTCTCTCCCCTTCACCCTCCTTCAGTTACAGCACCAAGCACCACCGAGGTGTTACCTGCAGCCTCACTCTCTACCTGGTTGTGGGATCCTGCCACTTCCTTAACCCACACTGAGTTCCTTGTGGTTCACAGGGTCACACAGAGGGCTGTAGAGATACAAAAGATATATGTGATTTTATATCACCTATCATATGAAGATATATTTATAAAATAGGAAACATATTAACCACTTATCATTTTATATATTTATGGTTTTATGTGTCAAAAATATATTGTTTCATGTATGTATTAAAGGATAAGTATGTATAAGAGGTTTTATAGATGTGTAAAATTATATATTTATACGTATCTTTACAAATTTAAGAATAAAGGAAGGAAAATTCTCAAAGAGGAATTCAGATATCAAGCAGTGCCCTTTGACCAAGAGCCTTGGTTACAACATACCTACAAAAGTGAACTATCATTGAAAGACCTATGGACACTGGATTTCTCTTTCCTTATTTAGAAGGGCAGTCTGTGTCTTGGAAAAGCATACAGTTTGTTGTATCTTGCTGGACAACAGGAGTCA';
    
    if (2*$size*$count-$size-$count >= length($input)) {
        die "selection may not complete; choose a shorter length or fewer substrings, or provide a longer input string\n";
    }
    
    my @substrings;
    while (@substrings < $count) {
        my $pos = int rand(length($input)-$size+1);
        push @substrings, substr($input, $pos, $size, $mark x $size)
            if substr($input, $pos, $size) !~ /\Q$mark/;
    }
    
    0 讨论(0)
提交回复
热议问题