Longest equally-spaced subsequence

前端 未结 10 1626
遥遥无期
遥遥无期 2020-12-22 19:12

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example



        
相关标签:
10条回答
  • 2020-12-22 19:39

    This is my 2 cents.

    If you have a list called input:

    input = [1, 4, 5, 7, 8, 12]
    

    You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:

    [1, 4, 5, 7, 8, 12]
     x  3  4  6  7  11   # distance from point i to point 0
     x  x  1  3  4   8   # distance from point i to point 1
     x  x  x  2  3   7   # distance from point i to point 2
     x  x  x  x  1   5   # distance from point i to point 3
     x  x  x  x  x   4   # distance from point i to point 4
    

    Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.

    The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].

    Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.

    Possible implementation (sorry if the code is not using the same notation as the explanation):

    def build_columns(l):
        columns = {}
        for x in l[1:]:
            col = []
            for y in l[:l.index(x)]:
                col.append(x - y)
            columns[x] = col
        return columns
    
    def algo(input, columns):
        seqs = []
        for index1, number in enumerate(input[1:]):
            index1 += 1 #first item was sliced
            for index2, distance in enumerate(columns[number]):
                seq = []
                seq.append(input[index2]) # k-th pred
                seq.append(number)
                matches = 1
                for successor in input[index1 + 1 :]:
                    column = columns[successor]
                    if column[index1] == distance * matches:
                        matches += 1
                        seq.append(successor)
                if (len(seq) > 2):
                    seqs.append(seq)
        return seqs
    

    The longest one:

    print max(sequences, key=len)
    
    0 讨论(0)
  • 2020-12-22 19:44

    We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

    Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.

    A = [1, 4, 5, 7, 8, 12]    # in sorted order
    Aset = set(A)
    
    for d in range(1, 12):
        already_seen = set()
        for a in A:
            if a not in already_seen:
                b = a
                count = 1
                while b + d in Aset:
                    b += d
                    count += 1
                    already_seen.add(b)
                print "found %d items in %d .. %d" % (count, a, b)
                # collect here the largest 'count'
    

    Updates:

    • This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)

    • If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

    0 讨论(0)
  • 2020-12-22 19:48

    Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.

    The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.

    The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.

    A = [1, 4, 5, 7, 8, 12]    # in sorted order
    Aset = set(A)
    
    lmax = 2
    for j, b in enumerate(A):
        for i in range(j):
            a = A[i]
            step = b - a
            if b + step in Aset and a - step not in Aset:
                c = b + step
                count = 3
                while c + step in Aset:
                    c += step
                    count += 1
                #print "found %d items in %d .. %d" % (count, a, c)
                if count > lmax:
                    lmax = count
    
    print lmax
    
    0 讨论(0)
  • 2020-12-22 19:52

    UPDATE: I've found a paper on this problem, you can download it here.

    Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.

    We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).

    lmax = 2
    l = [[2 for i in xrange(n)] for j in xrange(n)]
    for mid in xrange(n - 1):
        prev = mid - 1
        succ = mid + 1
        while (prev >= 0 and succ < n):
            if a[prev] + a[succ] < a[mid] * 2:
                succ += 1
            elif a[prev] + a[succ] > a[mid] * 2:
                prev -= 1
            else:
                l[mid][succ] = l[prev][mid] + 1
                lmax = max(lmax, l[mid][succ])
                prev -= 1
                succ += 1
    
    print lmax
    
    0 讨论(0)
  • 2020-12-22 19:53

    Algorithm

    • Main loop traversing the list
    • If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
    • Remove all precalculated for current element
    • Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)

    So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)

    • index 0, element 1:
      • if 1 in precalc? No - do nothing
      • Do nothing
    • index 1, element 2:
      • if 2 in precalc? No - do nothing
      • check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
    • index 2, element 4:
      • if 4 in precalc? No - do nothing
        • check if 6 = 2 + (4 - 2) * 2 in our set? No
        • check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
    • index 3, element 5:
      • if 5 in precalc? No - do nothing
        • do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
        • check if 8 = 2 + (5 - 2) * 2 in our set? No
        • check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
    • index 4, element 7:
      • if 7 in precalc? Yes - put it into result
        • do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7

    result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence

    Complexity

    It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later

    Code

    def add_precalc(precalc, start, step, count, res, N):
        if step == 0: return True
        if start + step * res[1]["count"] > N: return False
    
        x = start + step * count
        if x > N or x < 0: return False
    
        if precalc[x] is None: return True
    
        if step not in precalc[x]:
            precalc[x][step] = {"start":start, "count":count}
    
        return True
    
    def work(a):
        precalc = [None] * (max(a) + 1)
        for x in a: precalc[x] = {}
        N, m = max(a), 0
        ind = {x:i for i, x in enumerate(a)}
    
        res = (0, {"start":0, "count":0})
        for i, x in enumerate(a):
            for el in precalc[x].iteritems():
                el[1]["count"] += 1
                if el[1]["count"] > res[1]["count"]: res = el
                add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
                t = el[1]["start"] + el[0] * el[1]["count"]
                if t in ind and ind[t] > m:
                    m = ind[t]
            precalc[x] = None
    
            for y in a[i - m - 1::-1]:
                if not add_precalc(precalc, y, x - y, 2, res, N): break
    
        return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]
    
    0 讨论(0)
  • 2020-12-22 19:53

    This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

    0 讨论(0)
提交回复
热议问题