I have an sqlite database full of huge number of URLs and it\'s taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is
I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.
First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.
Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.
Here my the python code (apologies for the ugly testing):
import zlib
class Trained_short_string_compressor(object):
def __init__(self,
training_set,
bits = -zlib.MAX_WBITS,
compression = zlib.Z_DEFAULT_COMPRESSION,
scheme = zlib.DEFLATED):
# Use a negative number of bits, so the checksum is not included.
compressor = zlib.compressobj(compression,scheme,bits)
decompressor = zlib.decompressobj(bits)
junk_offset = 0
for line in training_set:
junk_offset += len(line)
# run the training line through the compressor and decompressor
junk_offset -= len(decompressor.decompress(compressor.compress(line)))
# use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and
# not flushing wastes space.
junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
self.junk_offset = junk_offset
self.compressor = compressor
self.decompressor = decompressor
def compress(self,s):
compressor = self.compressor.copy()
return compressor.compress(s)+compressor.flush()
def decompress(self,s):
decompressor = self.decompressor.copy()
return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]
Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.
def test_compress_small_strings():
lines =[l for l in gzip.open(fname)]
compressor=Trained_short_string_compressor(lines[:500])
import time
t = time.time()
s = 0.0
sc = 0.
for i in range(10000):
line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
cl = compressor.compress(line)
ucl = compressor.decompress(cl)
s += len(line)
sc+=len(cl)
assert line == ucl
print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
print 'now, compare it ot a naive compression '
t = time.time()
for i in range(10000):
line = lines[1000+i]
cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
cl=cr.compress(line)+cr.flush()
ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
sc += len(cl)
assert line == ucl
print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s
Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.