What algorithm can you use to find duplicate phrases in a string?

前端未结

关注

 5  1142

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.

相关标签:

5条回答

死守一世寂寞

2020-12-15 12:04

Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.

0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-12-15 12:05

Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-15 12:07

Like jmah said, you can use suffix trees/suffix arrays for this.

There is a description of an algorithm you could use here (see Section 3.1).

You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.

0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2020-12-15 12:23

suppose you are given sorted array A with n entries (i=1,2,3,...,n)

Algo(A(i))
{
  while i<>n
  {
    temp=A[i];
    if A[i]<>A[i+1] then
    {     
      temp=A[i+1];
      i=i+1;
      Algo(A[i])
    }
    else if A[i]==A[i+1] then
      mark A[i] and A[i+1] as duplicates
  }
}

This algo runs at O(n) time.

0 讨论(0)

囚心锁ツ

2020-12-15 12:29
In theory
- A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
- A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
- The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice

I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.

In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
0 讨论(0)
发布评论:

提交评论
- 加载中...