suffix-array | 易学教程

Complete Suffix Array

阅读更多关于 Complete Suffix Array

问题 A suffix array will index all the suffixes for a given list of strings, but what if you're trying to index all the possible unique substrings? I'm a bit new at this, so here's an example of what I mean: Given the string abcd A suffix array indexes (at least to my understanding) (abcd,bcd,cd,d) I would like to index (all the substrings) (abcd,bcd,cd,d,abc,bc,c,ab,b,a) Is a suffix array what I'm looking for? If so, what do I do to get all the substrings indexed? If not, where should I be

How does LCP help in finding the number of occurrences of a pattern?

阅读更多关于 How does LCP help in finding the number of occurrences of a pattern?

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string. Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurrences, you simply compute the LCP for each successive entry in the suffix array. Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here. For example for this suffix array for banana : LCP

Suffix Array Algorithm

阅读更多关于 Suffix Array Algorithm

After quite a bit of reading, I have figured out what a suffix array and LCP array represents. Suffix array : Represents the _lexicographic rank of each suffix of an array. LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically . I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works. Here is the code , which is taken from Codeforces : /* Suffix array O(n lg^2 n) LCP table O(n) */ #include <cstdio> #include <algorithm> #include <cstring> using namespace std;

What's the current state-of-the-art suffix array construction algorithm?

阅读更多关于 What's the current state-of-the-art suffix array construction algorithm?

I'm looking for a fast suffix-array construction algorithm. I'm more interested in ease of implementation and raw speed than asymptotic complexity (I know that a suffix array can be constructed by means of a suffix tree in O(n) time, but that takes a lot of space; apparently other algorithms have bad worst-case big-O complexity, but run quite fast in practice). I don't mind algorithms that generate an LCP array as a by-product, since I need one anyway for my own purposes. I found a taxonomy of various suffix array construction algorithms , but it's out of date. I've heard of SA-IS , qsufsort ,

Finding the longest repeated substring

阅读更多关于 Finding the longest repeated substring

What would be the best approach (performance-wise) in solving this problem? I was recommended to use suffix trees. Is this the best approach? Have a look at http://en.wikipedia.org/wiki/Suffix_array as well - they are quite space-efficient and have some reasonably programmable algorithms to produce them, such as "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders user1071840 Check out this link: http://introcs.cs.princeton.edu/java/42sort/LRS.java.html /************************************************************************* * Compilation: javac LRS.java * Execution: java

Effcient way to find longest duplicate string for Python (From Programming Pearls)

阅读更多关于 Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c When I implement it in Python using suffix-array: example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x in zip(p, q): if x[0] == x[1]: i += 1 else: break return i suffix_list = [] example_len = len(example) idx = list(range(example_len)) idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW max_len = -1 for i in range(example_len - 1): this_len = comlen(example[idx[i]:], example[idx[i+1]:]) print this_len if this_len > max_len: max_len =

Effcient way to find longest duplicate string for Python (From Programming Pearls)

阅读更多关于 Effcient way to find longest duplicate string for Python (From Programming Pearls)

问题 From Section 15.2 of Programming Pearls The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c When I implement it in Python using suffix-array: example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x in zip(p, q): if x[0] == x[1]: i += 1 else: break return i suffix_list = [] example_len = len(example) idx = list(range(example_len)) idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW max_len = -1 for i in range(example_len - 1):

Suffix Array Algorithm

阅读更多关于 Suffix Array Algorithm

问题 After quite a bit of reading, I have figured out what a suffix array and LCP array represents. Suffix array : Represents the _lexicographic rank of each suffix of an array. LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically . I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works. Here is the code , which is taken from Codeforces: /* Suffix array O(n lg^2 n)

Longest Common Substring

阅读更多关于 Longest Common Substring

问题 We have two strings a and b respectively. The length of a is greater than or equal to b . We have to find out the longest common substring. If there are multiple answers then we have to output the substring which comes earlier in b (earlier as in whose starting index comes first). Note: The length of a and b can be up to 10 6 . I tried to find the longest common substring using suffix array (sorting the suffixes using quicksort). For the case when there is more than one answer, I tried

Finding the longest repeated substring

阅读更多关于 Finding the longest repeated substring

问题 What would be the best approach (performance-wise) in solving this problem? I was recommended to use suffix trees. Is this the best approach? 回答1: Have a look at http://en.wikipedia.org/wiki/Suffix_array as well - they are quite space-efficient and have some reasonably programmable algorithms to produce them, such as "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders 回答2: Check out this link: http://introcs.cs.princeton.edu/java/42sort/LRS.java.html /********************