Longest equally-spaced subsequence

前端 未结 10 1627
遥遥无期
遥遥无期 2020-12-22 19:12

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example



        
相关标签:
10条回答
  • 2020-12-22 19:54

    Greedy method
    1 .Only one sequence of decision is generated.
    2. Many number of decisions are generated. Dynamic programming 1. It does not guarantee to give an optimal solution always.
    2. It definitely gives an optimal solution.

    0 讨论(0)
  • 2020-12-22 19:55

    Traverse the array, keeping a record of the optimal result/s and a table with

    (1) index - the element difference in the sequence,
    (2) count - number of elements in the sequence so far, and
    (3) the last recorded element.

    For each array element look at the difference from each previous array element; if that element is last in a sequence indexed in the table, adjust that sequence in the table, and update the best sequence if applicable, otherwise start a new sequence, unless the current max is greater than the length of the possible sequence.

    Scanning backwards we can stop our scan when d is greater than the middle of the array's range; or when the current max is greater than the length of the possible sequence, for d greater than the largest indexed difference. Sequences where s[j] is greater than the last element in the sequence are deleted.

    I converted my code from JavaScript to Python (my first python code):

    import random
    import timeit
    import sys
    
    #s = [1,4,5,7,8,12]
    #s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
    #s = [0, 6, 7, 10, 11, 12, 16, 18, 19]
    
    m = [random.randint(1,40000) for r in xrange(20000)]
    s = list(set(m))
    s.sort()
    
    lenS = len(s)
    halfRange = (s[lenS-1] - s[0]) // 2
    
    while s[lenS-1] - s[lenS-2] > halfRange:
        s.pop()
        lenS -= 1
        halfRange = (s[lenS-1] - s[0]) // 2
    
    while s[1] - s[0] > halfRange:
        s.pop(0)
        lenS -=1
        halfRange = (s[lenS-1] - s[0]) // 2
    
    n = lenS
    
    largest = (s[n-1] - s[0]) // 2
    #largest = 1000 #set the maximum size of d searched
    
    maxS = s[n-1]
    maxD = 0
    maxSeq = 0
    hCount = [None]*(largest + 1)
    hLast = [None]*(largest + 1)
    best = {}
    
    start = timeit.default_timer()
    
    for i in range(1,n):
    
        sys.stdout.write(repr(i)+"\r")
    
        for j in range(i-1,-1,-1):
            d = s[i] - s[j]
            numLeft = n - i
            if d != 0:
                maxPossible = (maxS - s[i]) // d + 2
            else:
                maxPossible = numLeft + 2
            ok = numLeft + 2 > maxSeq and maxPossible > maxSeq
    
            if d > largest or (d > maxD and not ok):
                break
    
            if hLast[d] != None:
                found = False
                for k in range (len(hLast[d])-1,-1,-1):
                    tmpLast = hLast[d][k]
                    if tmpLast == j:
                        found = True
                        hLast[d][k] = i
                        hCount[d][k] += 1
                        tmpCount = hCount[d][k]
                        if tmpCount > maxSeq:
                            maxSeq = tmpCount
                            best = {'len': tmpCount, 'd': d, 'last': i}
                    elif s[tmpLast] < s[j]:
                        del hLast[d][k]
                        del hCount[d][k]
                if not found and ok:
                    hLast[d].append(i)
                    hCount[d].append(2)
            elif ok:
                if d > maxD: 
                    maxD = d
                hLast[d] = [i]
                hCount[d] = [2]
    
    
    end = timeit.default_timer()
    seconds = (end - start)
    
    #print (hCount)
    #print (hLast)
    print(best)
    print(seconds)
    
    0 讨论(0)
  • 2020-12-22 19:57

    Your solution is O(N^3) now (you said O(N^2) per index). Here it is O(N^2) of time and O(N^2) of memory solution.

    Idea

    If we know subsequence that goes through indices i[0],i[1],i[2],i[3] we shouldn't try subsequence that starts with i[1] and i[2] or i[2] and i[3]

    Note I edited that code to make it a bit easier using that a sorted but it will not work for equal elements. You may check number max number of equal elements in O(N) easily

    Pseudocode

    I'm seeking only for max length but that doesn't change anything

    whereInA = {}
    for i in range(n):
       whereInA[a[i]] = i; // It doesn't matter which of same elements it points to
    
    boolean usedPairs[n][n];
    
    for i in range(n):
        for j in range(i + 1, n):
           if usedPair[i][j]:
              continue; // do not do anything. It was in one of prev sequences.
    
        usedPair[i][j] = true;
    
        //here quite stupid solution:
        diff = a[j] - a[i];
        if diff == 0:
           continue; // we can't work with that
        lastIndex = j
        currentLen = 2
        while whereInA contains index a[lastIndex] + diff :
            nextIndex = whereInA[a[lastIndex] + diff]
            usedPair[lastIndex][nextIndex] = true
            ++currentLen
            lastIndex = nextIndex
    
        // you may store all indicies here
        maxLen = max(maxLen, currentLen)
    

    Thoughts about memory usage

    O(n^2) time is very slow for 1000000 elements. But if you are going to run this code on such number of elements the biggest problem will be memory usage.
    What can be done to reduce it?

    • Change boolean arrays to bitfields to store more booleans per bit.
    • Make each next boolean array shorter because we only use usedPairs[i][j] if i < j

    Few heuristics:

    • Store only pairs of used indicies. (Conflicts with the first idea)
    • Remove usedPairs that will never used more (that are for such i,j that was already chosen in the loop)
    0 讨论(0)
  • 2020-12-22 19:58

    Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.


    Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).

    Algorithm uses the following data structures:

    1. prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
    2. hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
    3. pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.

    Algorithm:

    1. Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
    2. Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
    3. In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
    4. Remove hash map record found on step #2.
    5. Continue from step #2 while pq is not empty.

    This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.

    See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).

    And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.


    Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:

    def findLESS(A):
      Aset = set(A)
      lmax = 2
      d = 1
      minStep = 0
    
      while (lmax - 1) * minStep <= A[-1] - A[0]:
        minStep = A[-1] - A[0] + 1
        for j, b in enumerate(A):
          if j+d < len(A):
            a = A[j+d]
            step = a - b
            minStep = min(minStep, step)
            if a + step in Aset and b - step not in Aset:
              c = a + step
              count = 3
              while c + step in Aset:
                c += step
                count += 1
              if count > lmax:
                lmax = count
        d += 1
    
      return lmax
    
    print(findLESS([1, 4, 5, 7, 8, 12]))
    

    If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:

    def findLESS(src):
      r = [False for i in range(src[-1]+1)]
      for x in src:
        r[x] = True
    
      d = 1
      best = 1
    
      while best * d < len(r):
        for s in range(d):
          l = 0
    
          for i in range(s, len(r), d):
            if r[i]:
              l += 1
              best = max(best, l)
            else:
              l = 0
    
        d += 1
    
      return best
    
    
    print(findLESS([1, 4, 5, 7, 8, 12]))
    

    It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.


    Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):

    def findLESS(src):
      r = 0
      for x in src:
        r |= 1 << x
    
      d = 1
      best = 1
    
      while best * d < src[-1] + 1:
        c = best
        rr = r
    
        while c & (c-1):
          cc = c & -c
          rr &= rr >> (cc * d)
          c &= c-1
    
        while c != 1:
          c = c >> 1
          rr &= rr >> (c * d)
    
        rr &= rr >> d
    
        while rr:
          rr &= rr >> d
          best += 1
    
        d += 1
    
      return best
    

    Benchmarks:

    Input data (about 100000 integers) is generated this way:

    random.seed(42)
    s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
    

    And for fastest algorithms I also used the following data (about 1000000 integers):

    s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
    

    All results show time in seconds:

    Size:                         100000   1000000
    Second answer by Armin Rigo:     634         ?
    By Armin Rigo, optimized:         64     >5000
    O(M^2) algorithm:                 53      2940
    O(M^2*L) algorithm:                7       711
    
    0 讨论(0)
提交回复
热议问题