Is this algorithm linear?

后端 未结 2 558
耶瑟儿~
耶瑟儿~ 2020-12-28 13:13

Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increas

相关标签:
2条回答
  • 2020-12-28 13:48

    You might want to have a look at the Z-algorithm, that's provably linear:

    s is a C-string of length N

    Z[0] = N;
    int a = 0, b = 0;
    for (int i = 1; i < N; ++i)
    {
      int k = i < b ? min(b - i, Z[i - a]) : 0;
      while (i + k < N && s[i + k] == s[k]) ++k;
        Z[i] = k;
      if (i + k > b) { a = i; b = i + k; }
    }
    

    Now similarity is just the sum of entries of Z.

    0 讨论(0)
  • 2020-12-28 14:07

    This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).

    Here is my attempt at a counterexample. (I'm not a mathematician so please forgive my use of Python instead of equations to express my ideas!)

    Consider the string with 4K+1 symbols

    s = 'a'*K+'X'+'a'*3*K
    

    This will have

    borders[1:] = range(K)*2+[K]*(2*K+1)
    
    ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)
    

    Note that:

    1) ne_borders[i] will equal K for (2K+1) values of i.

    2) for 0<=j<=K, borders[j]=j-1

    3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i

    4) the inner loop will iterate K times to reduce j to 0

    5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.

    For example, for K=4 it goes round the inner loop 39 times

    s = 'aaaaXaaaaaaaaaaaa'
    borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4]
    ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]
    

    For K=2,248 it goes round the inner loop 10,111,503 times!

    Perhaps there is a way to fix the algorithm for this case?

    0 讨论(0)
提交回复
热议问题