The classical RLE algorithm compresses data by using numbers to represent how many times the character following a number appears in the text at that position. For example:
It can be done in quadratic cubic quadratic time via dynamic programming.
Here is some Python code:
import sys
import numpy as np
bignum = 10000
S = sys.argv[1] #'AAABBAAABBCECE'
N = len(S)
# length of longest substring match bet s[i:] and s[j:]
maxmatch = np.zeros( (N+1,N+1), dtype=int)
for i in xrange(N-1,-1,-1):
for j in xrange(i+1,N):
if S[i] == S[j]:
maxmatch[i,j] = maxmatch[i+1,j+1]+1
# P[n,k] = cost of encoding first n characters given that last k are a block
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters
Q = np.zeros(N+1, dtype=int) + bignum
# base case: no cost for empty string
P[0,0]=0
Q[0]=0
for n in xrange(1,N+1):
for k in xrange(1,n+1):
if n-2*k >= 0:
# s1, s2 = S[n-k:n], S[n-2*k:n-k]
# if s1 == s2:
if maxmatch[n-2*k,n-k] >=k:
# Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k
P[n,k] = min(P[n,k], P[n-k,k])
print 'P[%d,%d] = %d' % (n,k,P[n,k])
# Here we are starting a new block: 1 x_1...x_k
P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
print 'P[%d,%d] = %d' % (n,k,P[n,k])
for k in xrange(1,n+1):
Q[n] = min(Q[n], P[n,k])
print
print Q[N]
You can reconstruct the actual encoding by remembering your choices along the way.
I have left out a minor wrinkle, which is that we might have to use an extra byte to hold C+1 if C is large. If you are using 32-bit ints, this will not come up in any context where this algorithm's runtime is feasible. If you are sometimes using shorter ints to save space then you will have to think about it, and maybe add another dimension to your table based on the size of the latest C. In theory, this might add a log(N) factor, but I don't think this will be apparent in practice.
Edit: For the benefit of @Moron, here is the same code with more print statements, so that you can more easily see what the algorithm is thinking:
import sys
import numpy as np
bignum = 10000
S = sys.argv[1] #'AAABBAAABBCECE'
N = len(S)
# length of longest substring match bet s[i:] and s[j:]
maxmatch = np.zeros( (N+1,N+1), dtype=int)
for i in xrange(N-1,-1,-1):
for j in xrange(i+1,N):
if S[i] == S[j]:
maxmatch[i,j] = maxmatch[i+1,j+1]+1
# P[n,k] = cost of encoding first n characters given that last k are a block
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters
Q = np.zeros(N+1, dtype=int) + bignum
# base case: no cost for empty string
P[0,0]=0
Q[0]=0
for n in xrange(1,N+1):
for k in xrange(1,n+1):
if n-2*k >= 0:
# s1, s2 = S[n-k:n], S[n-2*k:n-k]
# if s1 == s2:
if maxmatch[n-2*k,n-k] >=k:
# Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k
P[n,k] = min(P[n,k], P[n-k,k])
print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
# Here we are starting a new block: 1 x_1...x_k
P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
n-k]+1+k,n-k,S[n-k:n])
for k in xrange(1,n+1):
Q[n] = min(Q[n], P[n,k])
print
print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
print
print Q[N]
The last few lines of its output on ABCDABCDABCDBCD are like so:
Q[13] = 7 I can encode first 13 characters of S in only 7 characters!
P[14,1] = 9 I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
P[14,2] = 8 I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
P[14,3] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
P[14,4] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
P[14,5] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
P[14,6] = 12 I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
P[14,7] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
P[14,8] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
P[14,9] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
P[14,10] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
P[14,11] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
P[14,12] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
P[14,13] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
P[14,14] = 15 I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC
Q[14] = 8 I can encode first 14 characters of S in only 8 characters!
P[15,1] = 10 I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
P[15,2] = 10 I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
P[15,3] = 11 I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
P[15,3] = 9 I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
P[15,4] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
P[15,5] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
P[15,6] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
P[15,7] = 13 I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
P[15,8] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
P[15,9] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
P[15,10] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
P[15,11] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
P[15,12] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
P[15,13] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
P[15,14] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
P[15,15] = 16 I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD
Q[15] = 9 I can encode first 15 characters of S in only 9 characters!
I do not believe dynamic programming will work here, as you could have sub-strings about half the length of the full string in the solution. Looks like you need to use brute force. For a related problem, check out the Lempel-Ziv-Welch Algorithm. It is an efficient algorithm that finds a minimal encoding by using substrings.
A very common way to encode RLE compressed data is to designate a special byte as the "DLE" (sorry, I don't remember what that term stands for), which means "the next is a count followed by a byte".
This way, only repeating sequences needs to be encoded. Typically the DLE symbol is chosen to minimize the chance of it occuring naturally in the uncompressed data.
For your original example, let's set the full stop (or dot) as the DLE, this would encode your example as follows:
AAABBAAABBCECE => 3A2B3A2B1C1E1C1E <-- your encoding
AAABBAAABBCECE => .3ABB.3ABBCECE <-- my encoding
You would only encode a sequence if it actually ends up as saving space. If you limit the length of sequences to 255, so that the count fits in a byte, a sequence thus takes 3 bytes, the DLE, the count, and the byte to repeat. You would probably not encode 3-byte sequences either, because decoding those carries slightly more overhead than a non-encoded sequence.
In your trivial example, the saving is nonexistant, but if you try to compress a bitmap containing a screenshot of a mostly white program, like Notepad, or a browser, then you'll see real space savings.
If you should encounter the DLE character naturally, just emit a count of 0, since we know we would never encode a 0-length sequence, the DLE followed by a 0-byte means that you decode it as a single DLE byte.
Very clever ways of finding matching substrings may lead to considering suffix trees and suffix arrays. Thinking about suffix arrays and compression may lead you to the http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform. That may be the most elegant way of souping up run length encoding.