Finding the minimum length RLE

后端 未结 4 1620
感动是毒
感动是毒 2021-02-04 17:05

The classical RLE algorithm compresses data by using numbers to represent how many times the character following a number appears in the text at that position. For example:

4条回答
  •  说谎
    说谎 (楼主)
    2021-02-04 17:28

    It can be done in quadratic cubic quadratic time via dynamic programming.

    Here is some Python code:

    import sys
    import numpy as np
    
    bignum = 10000
    
    S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
    N = len(S)
    
    # length of longest substring match bet s[i:] and s[j:]                                                                                                        
    maxmatch = np.zeros( (N+1,N+1), dtype=int)
    
    for i in xrange(N-1,-1,-1):
      for j in xrange(i+1,N):
        if S[i] == S[j]:
          maxmatch[i,j] = maxmatch[i+1,j+1]+1
    
    # P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
    P = np.zeros( (N+1,N+1),dtype=int ) + bignum
    # Q[n] = cost of encoding first n characters                                                                                                                   
    Q = np.zeros(N+1, dtype=int) + bignum
    
    # base case: no cost for empty string                                                                                                                          
    P[0,0]=0
    Q[0]=0
    
    for n in xrange(1,N+1):
      for k in xrange(1,n+1):
        if n-2*k >= 0:
    #     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
    #     if s1 == s2:                                                                                                                                             
          if maxmatch[n-2*k,n-k] >=k:
            # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
            P[n,k] = min(P[n,k], P[n-k,k])
            print 'P[%d,%d] = %d' % (n,k,P[n,k])
        # Here we are starting a new block: 1 x_1...x_k                                                                                                            
        P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
        print 'P[%d,%d] = %d' % (n,k,P[n,k])
      for k in xrange(1,n+1):
        Q[n] = min(Q[n], P[n,k])
    
      print
    
    print Q[N]
    

    You can reconstruct the actual encoding by remembering your choices along the way.

    I have left out a minor wrinkle, which is that we might have to use an extra byte to hold C+1 if C is large. If you are using 32-bit ints, this will not come up in any context where this algorithm's runtime is feasible. If you are sometimes using shorter ints to save space then you will have to think about it, and maybe add another dimension to your table based on the size of the latest C. In theory, this might add a log(N) factor, but I don't think this will be apparent in practice.

    Edit: For the benefit of @Moron, here is the same code with more print statements, so that you can more easily see what the algorithm is thinking:

    import sys
    import numpy as np
    
    bignum = 10000
    
    S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
    N = len(S)
    
    # length of longest substring match bet s[i:] and s[j:]                                                                                                        
    maxmatch = np.zeros( (N+1,N+1), dtype=int)
    
    for i in xrange(N-1,-1,-1):
      for j in xrange(i+1,N):
        if S[i] == S[j]:
          maxmatch[i,j] = maxmatch[i+1,j+1]+1
    
    # P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
    P = np.zeros( (N+1,N+1),dtype=int ) + bignum
    # Q[n] = cost of encoding first n characters                                                                                                                   
    Q = np.zeros(N+1, dtype=int) + bignum
    
    # base case: no cost for empty string                                                                                                                          
    P[0,0]=0
    Q[0]=0
    
    for n in xrange(1,N+1):
      for k in xrange(1,n+1):
        if n-2*k >= 0:
    #     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
    #     if s1 == s2:                                                                                                                                             
          if maxmatch[n-2*k,n-k] >=k:
            # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
            P[n,k] = min(P[n,k], P[n-k,k])
            print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
    ,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
        # Here we are starting a new block: 1 x_1...x_k                                                                                                            
        P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
        print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
    n-k]+1+k,n-k,S[n-k:n])
      for k in xrange(1,n+1):
        Q[n] = min(Q[n], P[n,k])
    
      print
      print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
      print
    
    
    print Q[N]
    

    The last few lines of its output on ABCDABCDABCDBCD are like so:

    Q[13] = 7        I can encode first 13 characters of S in only 7 characters!
    
    P[14,1] = 9      I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
    P[14,2] = 8      I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
    P[14,3] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
    P[14,4] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
    P[14,5] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
    P[14,6] = 12     I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
    P[14,7] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
    P[14,8] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
    P[14,9] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
    P[14,10] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
    P[14,11] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
    P[14,12] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
    P[14,13] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
    P[14,14] = 15    I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC
    
    Q[14] = 8        I can encode first 14 characters of S in only 8 characters!
    
    P[15,1] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
    P[15,2] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
    P[15,3] = 11     I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
    P[15,3] = 9      I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
    P[15,4] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
    P[15,5] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
    P[15,6] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
    P[15,7] = 13     I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
    P[15,8] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
    P[15,9] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
    P[15,10] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
    P[15,11] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
    P[15,12] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
    P[15,13] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
    P[15,14] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
    P[15,15] = 16    I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD
    
    Q[15] = 9        I can encode first 15 characters of S in only 9 characters!
    

提交回复
热议问题