Most common substring of length X

前端未结

关注

 8  2149

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.

For example, if s=\"aoaoa\" and X=3

相关标签:

8条回答

粉色の甜心

2021-02-09 14:15

You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.

Update:

Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.

If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.

Now we only need a randomized pairwise independent rolling hash algorithm.

Update2:

Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.

0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2021-02-09 14:15

You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.

In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.

Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.

We carry on in this fashion until we reach the end of the string.

A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).

a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()

If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).

To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.

The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).

0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-02-09 14:16
It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)
- Build a hashtable of counts for each string length
- Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
- done.
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2021-02-09 14:25

Naive solution in Python

from collections import defaultdict
from operator    import itemgetter

def naive(s, X):
    freq = defaultdict(int)
    for i in range(len(s) - X + 1):
        freq[s[i:i+X]] += 1
    return max(freq.iteritems(), key=itemgetter(1))

print naive("aoaoa", 3)
# -> ('aoa', 2)

In plain English

Create mapping: substring of length X -> how many times it occurs in the s string
```
for i in range(len(s) - X + 1):
    freq[s[i:i+X]] += 1
```
Find a pair in the mapping with the largest second item (frequency)
```
max(freq.iteritems(), key=itemgetter(1))
```

0 讨论(0)

死守一世寂寞

2021-02-09 14:25

LZW algorithm does this

This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.

LZW on Wikipedia

0 讨论(0)
发布评论:

提交评论
- 加载中...

无人及你

2021-02-09 14:35

Here is a version I did in C. Hope that it helps.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
    unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;

    string = "aoaoa";
    n = 3;

    for (i = 0; i <= (strlen(string) - n); i++) {
        tmpstr = (char *)malloc(n + 1);
        strncpy(tmpstr, string + i, n);
        *(tmpstr + (n + 1)) = '\0';
        for (j = 0; j <= (strlen(string) - n); j++) {
            tmpstr2 = (char *)malloc(n + 1);
            strncpy(tmpstr2, string + j, n);
            *(tmpstr2 + (n + 1)) = '\0';
            if (!strcmp(tmpstr, tmpstr2))
                matchcount++;
        }
        if (matchcount > maxcount) {
            maxstring = tmpstr;
            maxcount = matchcount;
        }
        matchcount = 0;
    }

    printf("max string: \"%s\", count: %d\n", maxstring, maxcount);

    free(tmpstr);
    free(tmpstr2);

    return 0;
}

0 讨论(0)

1 2 下一页

Most common substring of length X

Naive solution in Python

In plain English

LZW algorithm does this