Longest equally-spaced subsequence

前端未结

关注

 10  1626

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

相关标签:

10条回答

死守一世寂寞

2020-12-22 19:39

This is my 2 cents.

If you have a list called input:

input = [1, 4, 5, 7, 8, 12]

You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:

[1, 4, 5, 7, 8, 12]
 x  3  4  6  7  11   # distance from point i to point 0
 x  x  1  3  4   8   # distance from point i to point 1
 x  x  x  2  3   7   # distance from point i to point 2
 x  x  x  x  1   5   # distance from point i to point 3
 x  x  x  x  x   4   # distance from point i to point 4

Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.

The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].

Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.

Possible implementation (sorry if the code is not using the same notation as the explanation):

def build_columns(l):
    columns = {}
    for x in l[1:]:
        col = []
        for y in l[:l.index(x)]:
            col.append(x - y)
        columns[x] = col
    return columns

def algo(input, columns):
    seqs = []
    for index1, number in enumerate(input[1:]):
        index1 += 1 #first item was sliced
        for index2, distance in enumerate(columns[number]):
            seq = []
            seq.append(input[index2]) # k-th pred
            seq.append(number)
            matches = 1
            for successor in input[index1 + 1 :]:
                column = columns[successor]
                if column[index1] == distance * matches:
                    matches += 1
                    seq.append(successor)
            if (len(seq) > 2):
                seqs.append(seq)
    return seqs

The longest one:

print max(sequences, key=len)

0 讨论(0)

孤街浪徒

2020-12-22 19:44
We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.
```
A = [1, 4, 5, 7, 8, 12]    # in sorted order
Aset = set(A)

for d in range(1, 12):
    already_seen = set()
    for a in A:
        if a not in already_seen:
            b = a
            count = 1
            while b + d in Aset:
                b += d
                count += 1
                already_seen.add(b)
            print "found %d items in %d .. %d" % (count, a, b)
            # collect here the largest 'count'
```
Updates:
- This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
- If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-22 19:48
Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.

The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.

The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.
```
A = [1, 4, 5, 7, 8, 12]    # in sorted order
Aset = set(A)

lmax = 2
for j, b in enumerate(A):
    for i in range(j):
        a = A[i]
        step = b - a
        if b + step in Aset and a - step not in Aset:
            c = b + step
            count = 3
            while c + step in Aset:
                c += step
                count += 1
            #print "found %d items in %d .. %d" % (count, a, c)
            if count > lmax:
                lmax = count

print lmax
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-22 19:52
UPDATE: I've found a paper on this problem, you can download it here.

Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.

We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).
```
lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
    prev = mid - 1
    succ = mid + 1
    while (prev >= 0 and succ < n):
        if a[prev] + a[succ] < a[mid] * 2:
            succ += 1
        elif a[prev] + a[succ] > a[mid] * 2:
            prev -= 1
        else:
            l[mid][succ] = l[prev][mid] + 1
            lmax = max(lmax, l[mid][succ])
            prev -= 1
            succ += 1

print lmax
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-22 19:53
Algorithm
- Main loop traversing the list
- If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
- Remove all precalculated for current element
- Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)
So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)
- index 0, element 1:
  - if 1 in precalc? No - do nothing
  - Do nothing
- index 1, element 2:
  - if 2 in precalc? No - do nothing
  - check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
- index 2, element 4:
  - if 4 in precalc? No - do nothing
    - check if 6 = 2 + (4 - 2) * 2 in our set? No
    - check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
- index 3, element 5:
  - if 5 in precalc? No - do nothing
    - do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
    - check if 8 = 2 + (5 - 2) * 2 in our set? No
    - check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
- index 4, element 7:
  - if 7 in precalc? Yes - put it into result
    - do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7
result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence

Complexity

It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later

Code
```
def add_precalc(precalc, start, step, count, res, N):
    if step == 0: return True
    if start + step * res[1]["count"] > N: return False

    x = start + step * count
    if x > N or x < 0: return False

    if precalc[x] is None: return True

    if step not in precalc[x]:
        precalc[x][step] = {"start":start, "count":count}

    return True

def work(a):
    precalc = [None] * (max(a) + 1)
    for x in a: precalc[x] = {}
    N, m = max(a), 0
    ind = {x:i for i, x in enumerate(a)}

    res = (0, {"start":0, "count":0})
    for i, x in enumerate(a):
        for el in precalc[x].iteritems():
            el[1]["count"] += 1
            if el[1]["count"] > res[1]["count"]: res = el
            add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
            t = el[1]["start"] + el[0] * el[1]["count"]
            if t in ind and ind[t] > m:
                m = ind[t]
        precalc[x] = None

        for y in a[i - m - 1::-1]:
            if not add_precalc(precalc, y, x - y, 2, res, N): break

    return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-22 19:53

This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页