Algorithm to find the most common substrings in a string

后端未结

关注

 5  2102

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have \"hello world\" as its most

相关标签:

5条回答

情话喂你

2020-12-01 04:20
This is as task similar to Nussinov algorithm and actually even simpler as we do not allow any gaps, insertions or mismatches in the alignment.

For the string A having the length N, define a F[-1 .. N, -1 .. N] table and fill in using the following rules:
```
  for i = 0 to N
    for j = 0 to N
      if i != j
        {
          if A[i] == A[j]
            F[i,j] = F [i-1,j-1] + 1;
          else
            F[i,j] = 0;
        }
```
For instance, for B A O B A B:

This runs in O(n^2) time. The largest values in the table now point to the end positions of the longest self-matching subquences (i - the end of one occurence, j - another). In the beginning, the array is assumed to be zero-initialized. I have added condition to exclude the diagonal that is the longest but probably not interesting self-match.

Thinking more, this table is symmetric over diagonal so it is enough to compute only half of it. Also, the array is zero initialized so assigning zero is redundant. That remains
```
  for i = 0 to N
    for j = i + 1 to N
      if A[i] == A[j]
         F[i,j] = F [i-1,j-1] + 1;
```
Shorter but potentially more difficult to understand. The computed table contains all matches, short and long. You can add further filtering as you need.

On the next step, you need to recover strings, following from the non zero cells up and left by diagonal. During this step is also trivial to use some hashmap to count the number of self-similarity matches for the same string. With normal string and normal minimal length only small number of table cells will be processed through this map.

I think that using hashmap directly actually requires O(n^3) as the key strings at the end of access must be compared somehow for equality. This comparison is probably O(n).
0 讨论(0)
发布评论:

提交评论
- 加载中...

深忆病人

2020-12-01 04:25

Perl, O(n²) solution

my $str = "hello world this is hello world. hello world repeats three times in this string!";

my @words = split(/[^a-z]+/i, $str);
my ($display,$ix,$i,%ocur) = 10;

# calculate

for ($ix=0 ; $ix<=$#words ; $ix++) {
  for ($i=$ix ; $i<=$#words ; $i++) {
    $ocur{ join(':', @words[$ix .. $i]) }++;
  }
}

# display 

foreach (sort { my $c = $ocur{$b} <=> $ocur{$a} ; return $c ? $c : split(/:/,$b)-split(/:/,$a); } keys %ocur) {
  print "$_: $ocur{$_}\n";
  last if !--$display;
}

displays the 10 best scores of the most common sub strings (in case of tie, show the longest chain of words first). Change $display to 1 to have only the result.
There are n(n+1)/2 iterations.

0 讨论(0)

无人共我

2020-12-01 04:28
Python. This is somewhat quick and dirty, with the data structures doing most of the lifting.
```
from collections import Counter
accumulator = Counter()
text = 'hello world this is hello world.'
for length in range(1,len(text)+1):
    for start in range(len(text) - length):
        accumulator[text[start:start+length]] += 1
```
The Counter structure is a hash-backed dictionary designed for counting how many times you've seen something. Adding to a nonexistent key will create it, while retrieving a nonexistent key will give you zero instead of an error. So all you have to do is iterate over all the substrings.
0 讨论(0)
发布评论:

提交评论
- 加载中...

离开以前

2020-12-01 04:28

Since for every substring of a String of length >= 2 the text contains at least one substring of length 2 at least as many times, we only need to investigate substrings of length 2.

val s = "hello world this is hello world. hello world repeats three times in this string!"

val li = s.sliding (2, 1).toList
// li: List[String] = List(he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " t", th, hi, is, "s ", " i", is, "s ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, d., ". ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " r", re, ep, pe, ea, at, ts, "s ", " t", th, hr, re, ee, "e ", " t", ti, im, me, es, "s ", " i", in, "n ", " t", th, hi, is, "s ", " s", st, tr, ri, in, ng, g!)

val uniques = li.toSet
uniques.toList.map (u => li.count (_ == u))
// res18: List[Int] = List(1, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 1, 3, 2, 1, 3, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 3, 2, 4, 1, 2, 2, 1)

uniques.toList(6)
res19: String = "s "

0 讨论(0)

半阙折子戏

2020-12-01 04:35

just pseudo code, and maybe this isn't the most beautiful solution, but I would solve like this:

function separateWords(String incomingString) returns StringArray{
  //Code
}

function findMax(Map map) returns String{
  //Code
}

function mainAlgorithm(String incomingString) returns String{
    StringArray sArr = separateWords(incomingString);
    Map<String, Integer> map; //init with no content
    for(word: sArr){
        Integer count = map.get(word);
        if(count == null){
            map.put(word,1);
        } else {
            //remove if neccessary
            map.put(word,count++); 
        }
   }
   return findMax(map);
}

Where map can contain a key, value pairs like in Java HashMap.

0 讨论(0)