I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example
This is my 2 cents.
If you have a list called input:
input = [1, 4, 5, 7, 8, 12]
You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:
[1, 4, 5, 7, 8, 12]
x 3 4 6 7 11 # distance from point i to point 0
x x 1 3 4 8 # distance from point i to point 1
x x x 2 3 7 # distance from point i to point 2
x x x x 1 5 # distance from point i to point 3
x x x x x 4 # distance from point i to point 4
Now that you have the columns, you can consider the i-th
item of input (which is input[i]
) and each number n
in its column.
The numbers that belong to a series of equidistant numbers that include input[i]
, are those which have n * j
in the i-th
position of their column, where j
is the number of matches already found when moving columns from left to right, plus the k-th
predecessor of input[i]
, where k
is the index of n
in the column of input[i]
.
Example: if we consider i = 1
, input[i] = 4
, n = 3
, then, we can identify a sequence comprehending 4
(input[i]
), 7
(because it has a 3
in position 1
of its column) and 1
, because k
is 0, so we take the first predecessor of i
.
Possible implementation (sorry if the code is not using the same notation as the explanation):
def build_columns(l):
columns = {}
for x in l[1:]:
col = []
for y in l[:l.index(x)]:
col.append(x - y)
columns[x] = col
return columns
def algo(input, columns):
seqs = []
for index1, number in enumerate(input[1:]):
index1 += 1 #first item was sliced
for index2, distance in enumerate(columns[number]):
seq = []
seq.append(input[index2]) # k-th pred
seq.append(number)
matches = 1
for successor in input[index1 + 1 :]:
column = columns[successor]
if column[index1] == distance * matches:
matches += 1
seq.append(successor)
if (len(seq) > 2):
seqs.append(seq)
return seqs
The longest one:
print max(sequences, key=len)
We can have a solution O(n*m)
in time with very little memory needs, by adapting yours. Here n
is the number of items in the given input sequence of numbers, and m
is the range, i.e. the highest number minus the lowest one.
Call A the sequence of all input numbers (and use a precomputed set()
to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n)
for every value of d.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
for d in range(1, 12):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
already_seen.add(b)
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
Updates:
This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000
would be good enough. Then the complexity goes down to O(n*1000)
. This makes the algorithm approximative, but actually runnable for n=1000000
. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.
Here is another answer, working in time O(n^2)
and without any notable memory requirements beyond that of turning the list into a set.
The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a
and b
you check how far you can extend to b + (b-a)
, b + 2*(b-a)
, ... but only if a - (b-a)
is not already in the set of all points. If it is, then you already saw the same subsequence.
The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2)
from the original O(n^3)
. That's left as an exercice to the reader :-) The time is competitive with other O(n^2)
solutions here.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
lmax = 2
for j, b in enumerate(A):
for i in range(j):
a = A[i]
step = b - a
if b + step in Aset and a - step not in Aset:
c = b + step
count = 3
while c + step in Aset:
c += step
count += 1
#print "found %d items in %d .. %d" % (count, a, c)
if count > lmax:
lmax = count
print lmax
UPDATE: I've found a paper on this problem, you can download it here.
Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.
We assume all numbers are saved in array a
in ascending order, and n
saves its length. 2D array l[i][j]
defines length of longest equally-spaced subsequence ending with a[i]
and a[j]
, and l[j][k]
= l[i][j]
+ 1 if a[j]
- a[i]
= a[k]
- a[j]
(i < j < k).
lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
prev = mid - 1
succ = mid + 1
while (prev >= 0 and succ < n):
if a[prev] + a[succ] < a[mid] * 2:
succ += 1
elif a[prev] + a[succ] > a[mid] * 2:
prev -= 1
else:
l[mid][succ] = l[prev][mid] + 1
lmax = max(lmax, l[mid][succ])
prev -= 1
succ += 1
print lmax
Algorithm
So for list [1, 2, 4, 5, 7]
output would be (it's a little messy, try code yourself and see)
1
in precalc? No - do nothing2
in precalc? No - do nothing1
+ (2
- 1
) * 2 in our set? No - do nothing4
in precalc? No - do nothing
2
+ (4
- 2
) * 2 in our set? No1
+ (4
- 1
) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}}
7 - element of the list, 3 is step.5
:
5
in precalc? No - do nothing
4
because 6 = 4 + (5
- 4
) * 2 is less that calculated element 72
+ (5
- 2
) * 2 in our set? No 2
+ (5
- 1
) * 2 - more than max(a) == 77
:
5
because 9 = 5 + (7
- 5
) * 2 is more than max(a) == 7 result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence
Complexity
It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later
Code
def add_precalc(precalc, start, step, count, res, N):
if step == 0: return True
if start + step * res[1]["count"] > N: return False
x = start + step * count
if x > N or x < 0: return False
if precalc[x] is None: return True
if step not in precalc[x]:
precalc[x][step] = {"start":start, "count":count}
return True
def work(a):
precalc = [None] * (max(a) + 1)
for x in a: precalc[x] = {}
N, m = max(a), 0
ind = {x:i for i, x in enumerate(a)}
res = (0, {"start":0, "count":0})
for i, x in enumerate(a):
for el in precalc[x].iteritems():
el[1]["count"] += 1
if el[1]["count"] > res[1]["count"]: res = el
add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
t = el[1]["start"] + el[0] * el[1]["count"]
if t in ind and ind[t] > m:
m = ind[t]
precalc[x] = None
for y in a[i - m - 1::-1]:
if not add_precalc(precalc, y, x - y, 2, res, N): break
return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]
This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.