How to write a method that counts the most common substring in a string in ruby?

前端 未结 2 1671
广开言路
广开言路 2021-01-23 13:34

I have this program with a class DNA. The program counts the most frequent k-mer in a string. So, it is looking for the most common substring in a string with a length of k.

相关标签:
2条回答
  • 2021-01-23 13:47

    Code

    def most_frequent_substrings(str, k)
      (0..str.size-k).each_with_object({}) do |i,h|
        b = [] 
        str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
        (h[b.size] ||= []) << b
      end.max_by(&:first).last.each_with_object({}) { |a,h| h[str[a.first,k]] = a } 
    end
    

    Example

    str = "ABBABABBABCATSABBABB"
    most_frequent_substrings(str, 4)
      #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
    

    This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.

    Explanation

    For the example above the steps are as follows.

    k = 4
    n = str.size - k
      #=> 20 - 4 => 16
    e = (0..n).each_with_object([])
      #<Enumerator: 0..16:each_with_object([])> 
    

    We can see the values that will be generated by this enumerator by converting it to an array.

    e.to_a
      #=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
     #     [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]
    

    Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:

    i,a = e.next
      #=> [0, []] 
    i #=> 0 
    a #=> [] 
    

    We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.

    b = []
    r = Regexp.new str[i,k]
      #=> Regexp.new str[0,4]
      #=> Regexp.new "ABBA"
      #=> /ABAB/
    str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
      #=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i } 
    b #=> [0, 5, 14]
    

    We next have

    (h[b.size] ||= []) << b
    

    which becomes

    (h[b.size] = h[b.size] || []) << b
      #=> (h[3] = h[3] || []) <<  [0, 5, 14]
    

    Since h has no key 3, h[3] on the right side equals nil. Continuing,

      #=> (h[3] = nil || []) <<  [0, 5, 14]
      #=> (h[3] = []) <<  [0, 5, 14]
    h #=> { 3=>[[0, 5, 14]] }
    

    Notice that we throw away scan's return value. All we need is b

    This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.

    Now observe

    e.to_a
      #=> [[0, [[0, 5, 14]]],  [1, [[0, 5, 14]]],  [2, [[0, 5, 14]]],
      #    ...
      #    [16, [[0, 5, 14]]]]
    

    After all elements of e have been passed to the block, the block returns

    h #=> {3=>[[0, 5, 14], [1, 6, 15]],
      #    1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
      #    2=>[[4, 16], [5, 14], [6, 15]]} 
    

    Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:

    str[2,4]
      #=> "BABA"
    

    That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].

    Next we determine the greatest frequency of a substring of length 4:

    c = h.max_by(&:first)
      #=> [3, [[0, 5, 14], [1, 6, 15]]] 
    

    (which could also be written c = h.max_by { |k,_| k }).

    d = c.last
      #=> [[0, 5, 14], [1, 6, 15]]
    

    For convenience, convert d to a hash:

    d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
      #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
    

    and return that hash from the method.

    There is one detail that deserves mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.

    str = "AAA"
    k = 2
    

    In this case the array d above will equal

    d = [[0], [1]]
    

    Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:

    d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
      #=> {"AA"=>[1]}
    
    0 讨论(0)
  • 2021-01-23 13:58

    Something like this?

      require 'set'
    
      def count_kmer(k)
        max_kmers = kmers(k)
                        .each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
                        .group_by { |_,v| v }
                        .max
        [Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
      end
    
      def kmers(k)
        nucleotide.chars.each_cons(k).map(&:join)
      end
    

    EDIT: Here's the full text of the class:

    require 'set'
    
    class DNA
      def initialize (nucleotide)
        @nucleotide = nucleotide
      end
    
      def length
        @nucleotide.length
      end
    
      def count_kmer(k)
        max_kmers = kmers(k)
                        .each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
                        .group_by { |_,v| v }
                        .max
        [Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
      end
    
      def kmers(k)
        nucleotide.chars.each_cons(k).map(&:join)
      end
    
      protected
      attr_reader :nucleotide
    end
    

    This produces the following output, using Ruby 2.2.1, using the class and method you specified:

    >> dna1 = DNA.new('AACCAATCCG')
    => #<DNA:0x007fe15205bc30 @nucleotide="AACCAATCCG">
    >> dna1.count_kmer(1)
    => [#<Set: {"A", "C"}>, 4]
    >> dna1.count_kmer(2)
    => [#<Set: {"AA", "CC"}>, 2]
    

    As a bonus, you can also do:

    >> dna1.kmers(2)
    => ["AA", "AC", "CC", "CA", "AA", "AT", "TC", "CC", "CG"]
    
    0 讨论(0)
提交回复
热议问题