How to detect repeating digits in an infinite sequence? I tried Floyd & Brent detection algorithm but come to nothing... I have a generator that yields
Here's a fun take on the problem. The more general form of your question is this:
Given a repeating sequence of unknown length, determine the period of the signal.
The process to determine the repeating frequencies is known as the Fourier Transform. In your example case the signal is clean and discrete, but the following solution will work even with continuous noisy data! The FFT will try to duplicate the frequencies of the input signal by approximating them in the so-called "wave-space" or "Fourier-space". Basically a peak in this space corresponds to a repeating signal. The period of your signal is related to the longest wavelength that is peaked.
import itertools
# of course this is a fake one just to offer an example
def source():
return itertools.cycle((1, 0, 1, 4, 8, 2, 1, 3, 3, 2))
import pylab as plt
import numpy as np
import scipy as sp
# Generate some test data, i.e. our "observations" of the signal
N = 300
vals = source()
X = np.array([vals.next() for _ in xrange(N)])
# Compute the FFT
W = np.fft.fft(X)
freq = np.fft.fftfreq(N,1)
# Look for the longest signal that is "loud"
threshold = 10**2
idx = np.where(abs(W)>threshold)[0][-1]
max_f = abs(freq[idx])
print "Period estimate: ", 1/max_f
This gives the correct answer for this case, 10
though if N
didn't divide the cycles cleanly, you would get a close estimate. We can visualize this via:
plt.subplot(211)
plt.scatter([max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.plot(freq[:N/2], abs(W[:N/2]))
plt.xlabel(r"$f$")
plt.subplot(212)
plt.plot(1.0/freq[:N/2], abs(W[:N/2]))
plt.scatter([1/max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.xlabel(r"$1/f$")
plt.xlim(0,20)
plt.show()
Evgeny Kluev's answer provides a way to get an answer that might be right.
Let's assume you have some sequence D
that is a repeating sequence. That is there is some sequence d
of length L
such that: D_i = d_{i mod L}
, where t_i
is the i
th element of sequence t
that is numbered from 0. We will say sequence d
generates D
.
Given a sequence D
which you know is generated by some finite sequence t
. Given some d
it is impossible to decide in finite time whether it generates D
.
Since we are only allowed a finite time we can only access a finite number of elements of D
. Let us suppose we access the first F
elements of D
. We chose the first F
because if we are only allowed to access a finite number, the set containing the indices of the elements we've accessed is finite and hence has a maximum. Let that maximum be M
. We can then let F = M+1
, which is still a finite number.
Let L
be the length of d
and that D_i = d_{i mod L}
for i < F
. There are two possibilities for D_F
it is either the same as d_{F mod L}
or it is not. In the former case d
seems to work, but in the latter case it does not. We cannot know which case is true until we access D_F
. This will however require accessing F+1
elements, but we are limited to F
element accesses.
Hence, for any F
we won't have enough information to decide whether d
generates D
and therefore it is impossible to know in finite time whether d
generates D
.
It is possible to know in finite time that a sequence d
does not generate D
, but this doesn't help you. Since you want to find a sequence d
that does generate D
, but this involves amongst other things being able to prove that some sequence generates D
.
Unless you have more information about D
this problem is unsolvable. One bit of information that will make this problem decidable is some upper bound on the length of the shortest d
that generates D
. If you know the function generating D
only has a known amount of finite state you can calculate this upper bound.
Hence, my conclusion is that you cannot solve this problem unless you change the specification a bit.
I have no idea about proper algorithms to apply here, but my understanding also is that you can never know for sure that you've detected a period if you have consumed only a finite number of terms. Anyway, here's what I've come up with, this is a very naive implementation, more to educate from the comments than to provide a good solution (I guess).
def guess_period(source, minlen=1, maxlen=100, trials=100):
for n in range(minlen, maxlen+1):
p = [j for i, j in zip(range(n), source)]
if all([j for i, j in zip(range(n), source)] == p
for k in range(trials)):
return tuple(p)
return None
This one, however, "forgets" the initial order and returns a tuple that is a cyclic permutation of the actual period:
In [101]: guess_period(gen)
Out[101]: (0, 1, 4, 8, 2, 1, 3, 3, 1, 1)
To compensate for this, you'll need to keep track of the offset.
Since your sequence is not of the form Xn+1 = f(Xn), Floyd's or Brent's algorithms are not directly applicable to your case. But they may be extended to do the task:
k
).k
elements of the sequencek
-element subsequence.k
, update k
and continue with the step 3.If the sequence cycling starts from the first element, ignore step 1 and start from step 2 (find next sequence element equal to the first element).
If the sequence cycling does not start from the first element, it is possible that Floyd's or Brent's algorithm finds some repeating element of the sequence that does not belong to the cycle. So it is reasonable to limit the number of iterations in steps 2 and 4, and if this limit is exceeded, continue with the step 1.