How to write a method that counts the most common substring in a string in ruby?

前端未结

关注

 2  1671

I have this program with a class DNA. The program counts the most frequent k-mer in a string. So, it is looking for the most common substring in a string with a length of k.

相关标签:

2条回答

眼角桃花

2021-01-23 13:47
Code
```
def most_frequent_substrings(str, k)
  (0..str.size-k).each_with_object({}) do |i,h|
    b = [] 
    str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
    (h[b.size] ||= []) << b
  end.max_by(&:first).last.each_with_object({}) { |a,h| h[str[a.first,k]] = a } 
end
```
Example
```
str = "ABBABABBABCATSABBABB"
most_frequent_substrings(str, 4)
  #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
```
This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.

Explanation

For the example above the steps are as follows.
```
k = 4
n = str.size - k
  #=> 20 - 4 => 16
e = (0..n).each_with_object([])
  #<Enumerator: 0..16:each_with_object([])> 
```
We can see the values that will be generated by this enumerator by converting it to an array.
```
e.to_a
  #=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
 #     [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]
```
Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:
```
i,a = e.next
  #=> [0, []] 
i #=> 0 
a #=> [] 
```
We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.
```
b = []
r = Regexp.new str[i,k]
  #=> Regexp.new str[0,4]
  #=> Regexp.new "ABBA"
  #=> /ABAB/
str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
  #=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i } 
b #=> [0, 5, 14]
```
We next have
```
(h[b.size] ||= []) << b
```
which becomes
```
(h[b.size] = h[b.size] || []) << b
  #=> (h[3] = h[3] || []) <<  [0, 5, 14]
```
Since h has no key 3, h[3] on the right side equals nil. Continuing,
```
  #=> (h[3] = nil || []) <<  [0, 5, 14]
  #=> (h[3] = []) <<  [0, 5, 14]
h #=> { 3=>[[0, 5, 14]] }
```
Notice that we throw away scan's return value. All we need is b

This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.

Now observe
```
e.to_a
  #=> [[0, [[0, 5, 14]]],  [1, [[0, 5, 14]]],  [2, [[0, 5, 14]]],
  #    ...
  #    [16, [[0, 5, 14]]]]
```
After all elements of e have been passed to the block, the block returns
```
h #=> {3=>[[0, 5, 14], [1, 6, 15]],
  #    1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
  #    2=>[[4, 16], [5, 14], [6, 15]]} 
```
Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:
```
str[2,4]
  #=> "BABA"
```
That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].

Next we determine the greatest frequency of a substring of length 4:
```
c = h.max_by(&:first)
  #=> [3, [[0, 5, 14], [1, 6, 15]]] 
```
(which could also be written c = h.max_by { |k,_| k }).
```
d = c.last
  #=> [[0, 5, 14], [1, 6, 15]]
```
For convenience, convert d to a hash:
```
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
  #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
```
and return that hash from the method.

There is one detail that deserves mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.
```
str = "AAA"
k = 2
```
In this case the array d above will equal
```
d = [[0], [1]]
```
Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:
```
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
  #=> {"AA"=>[1]}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2021-01-23 13:58

Something like this?

  require 'set'

  def count_kmer(k)
    max_kmers = kmers(k)
                    .each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
                    .group_by { |_,v| v }
                    .max
    [Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
  end

  def kmers(k)
    nucleotide.chars.each_cons(k).map(&:join)
  end

EDIT: Here's the full text of the class:

require 'set'

class DNA
  def initialize (nucleotide)
    @nucleotide = nucleotide
  end

  def length
    @nucleotide.length
  end

  def count_kmer(k)
    max_kmers = kmers(k)
                    .each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
                    .group_by { |_,v| v }
                    .max
    [Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
  end

  def kmers(k)
    nucleotide.chars.each_cons(k).map(&:join)
  end

  protected
  attr_reader :nucleotide
end

This produces the following output, using Ruby 2.2.1, using the class and method you specified:

>> dna1 = DNA.new('AACCAATCCG')
=> #<DNA:0x007fe15205bc30 @nucleotide="AACCAATCCG">
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]

As a bonus, you can also do:

>> dna1.kmers(2)
=> ["AA", "AC", "CC", "CA", "AA", "AT", "TC", "CC", "CG"]

0 讨论(0)