Rank the suffixes of a list

后端 未结 2 1388
南方客
南方客 2020-12-06 03:27

ranking an element x in an array/list is just to find out how many elements in the array/list that strictly smaller than x.

So ranking

相关标签:
2条回答
  • 2020-12-06 04:05

    As MBo noted correctly, your problem is that of constructing the suffix array of your input list. The fast and complicated algorithms to do this are actually linear time, but since you only aim for O(n log n), I will try to propose a simpler version that is much easier to implement.

    Basic idea and an initial O(n log² n) implementation

    Let's take the sequence [4, 2, 2, 1] as an example. Its suffixes are

    0: 4 2 2 1
    1: 2 2 1
    2: 2 1
    3: 1
    

    I numbered the suffixes with their starting index in the original sequence. Ultimately we want to sort this set of suffixes lexicographically, and fast. We know we can represent each suffix using its starting index in constant space and we can sort in O(n log n) comparisons using merge sort, heap sort or a similar algorithm. So the question remains, how can we compare two suffixes fast?

    Let's say we want to compare the suffixes [2, 2, 1] and [2, 1]. We can pad those with negative infinity values changing the result of the comparison: [2, 2, 1, -∞] and [2, 1, -∞, -∞].

    Now the key idea here is the following divide-and-conquer observation: Instead of comparing the sequences character by character until we find a position where the two differ, we can instead split both lists in half and compare the halves lexicographically:

         [a, b, c, d]     < [e, f, g, h] 
     <=> ([a, b], [c, d]) < ([e, f], [g, h])
     <=> [a, b] < [e, f] or ([a, b,] = [e, f] and [c, d] < [g, h])
    

    Essentially we have decomposed the problem of comparing the sequences into two problems of comparing smaller sequences. This leads to the following algorithm:

    Step 1: Sort the substrings (contiguous subsequences) of length 1. In our example, the substrings of length 1 are [4], [2], [2], [1]. Every substring can be represented by the starting position in the original list. We sort them by a simple comparison sort and get [1], [2], [2], [4]. We store the result by assigning to every position it's rank in the sorted lists of lists:

    position   substring   rank
    0          [4]         2
    1          [2]         1
    2          [2]         1
    3          [1]         0
    

    It is important that we assign the same rank to equal substrings!

    Step 2: Now we want to sort the substrings of length 2. The are only really 3 such substrings, but we assign one to every position by padding with negative infinity if necessary. The trick here is that we can use our divide-and-conquer idea from above and the ranks assigned in step 1 to do a fast comparison (this isn't really necessary yet but will become important later).

    position  substring    halves        ranks from step 1   final rank
    0         [4,  2]      ([4], [2])    (2,  1)             3               
    1         [2,  2]      ([2], [2])    (1,  1)             2
    2         [2,  1]      ([2], [2])    (1,  0)             1
    3         [1, -∞]      ([1], [-∞])   (0, -∞)             0
    

    Step 3: You guessed it, now we sort substrings of length 4 (!). These are exactly the suffixes of the list! We can use the divide-and-conquer trick and the results from step 2 this time:

    position  substring         halves              ranks from step 2   final rank
    0         [4,  2,  2,  1]   ([4, 2], [2,  1])   (3,  1)             3
    1         [2,  2,  1, -∞]   ([2, 2], [1, -∞])   (2,  0)             2
    2         [2,  1, -∞, -∞]   ([2, 1], [-∞,-∞])   (1, -∞)             1
    3         [1, -∞, -∞, -∞]   ([1,-∞], [-∞,-∞])   (0, -∞)             0
    

    We're done! If our initial sequence would have had size 2^k, we would have needed k steps. Or put the other way round, we need log_2 n steps to process a sequence of size n. If its length is not a power of two, we just pad with negative infinity.

    For an actual implementation we just need to remember the sequence "final rank" for every step of the algorithm.

    An implementation in C++ could look like this (compile with -std=c++11):

    #include <algorithm>
    #include <iostream>
    using namespace std;
    
    int seq[] = {8, 3, 2, 4, 2, 2, 1};
    const int n = 7;
    const int log2n = 3;       // log2n = ceil(log_2(n))
    int Rank[log2n + 1][n];    // Rank[i] will save the final Ranks of step i
    tuple<int, int, int> L[n]; // L is a list of tuples. in step i,
                               // this will hold pairs of Ranks from step i - 1
                               // along with the substring index
    const int neginf = -1;     // should be smaller than all the numbers in seq
    int main() {
      for (int i = 0; i < n; ++i)
        Rank[1][i] = seq[i]; // step 1 is actually simple if you think about it
      for (int step = 2; step <= log2n; ++step) {
        int length = 1 << (step - 1); // length is 2^(step - 1)
        for (int i = 0; i < n; ++i)
          L[i] = make_tuple(
                    Rank[step - 1][i],
                    (i + length / 2 < n) ? Rank[step - 1][i + length / 2] : neginf,
                    i); // we need to know where the tuple came from later
        sort(L, L + n); // lexicographical sort
        for (int i = 0; i < n; ++i) {
          // we save the rank of the index, but we need to be careful to
          // assign equal ranks to equal pairs
          Rank[step][get<2>(L[i])] = (i > 0 && get<0>(L[i]) == get<0>(L[i - 1])
                                            && get<1>(L[i]) == get<1>(L[i - 1]))
                                        ? Rank[step][get<2>(L[i - 1])] 
                                        : i;
        }
      }
      // the suffix array is in L after the last step
      for (int i = 0; i < n; ++i) {
        int start = get<2>(L[i]);
        cout << start << ":";
        for (int j = start; j < n; ++j)
          cout << " " << seq[j];
        cout << endl;
      }
    }
    

    Output:

    6: 1
    5: 2 1
    4: 2 2 1
    2: 2 4 2 2 1
    1: 3 2 4 2 2 1
    3: 4 2 2 1
    0: 8 3 2 4 2 2 1
    

    The complexity is O(log n * (n + sort)), which is O(n log² n) in this implementation because we use a comparison sort of complexity O(n log n)

    A simple O(n log n) algorithm

    If we manage to do the sorting parts in O(n) per step, we get a O(n log n) bound. So basically we have to sort a sequence of pairs (x, y), where 0 <= x, y < n. We know that we can sort a sequence of integers in the given range in O(n) time using counting sort. We can intepret our pairs (x, y) as numbers z = n * x + y in base n. We can now see how to use LSD radix sort to sort the pairs. In practice, this means we sort the pairs by increasing y using counting sort, and then use counting sort again to sort by increasing x. Since counting sort is stable, this gives us the lexicographical order of our pairs in 2 * O(n) = O(n). The final complexity is thus O(n log n).

    In case you are interested, you can find an O(n log² n) implementation of the approach at my Github repo. The implementation has 27 lines of code. Neat, ain't it?

    0 讨论(0)
  • 2020-12-06 04:26

    This is exactly suffix array construction problem, and wiki page contains links to the linear-complexity algorithms (probably, depending on alphabet)

    0 讨论(0)
提交回复
热议问题