How to compress small strings

后端未结

关注

 7  1289

没有蜡笔的小新 2021-02-01 09:22

I have an sqlite database full of huge number of URLs and it\'s taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is

7条回答

借酒劲吻你 (楼主)

2021-02-01 10:17

I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.

First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.

Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.

Here my the python code (apologies for the ugly testing):

import zlib
class Trained_short_string_compressor(object):
    def __init__(self,
                 training_set, 
                 bits = -zlib.MAX_WBITS,
                 compression = zlib.Z_DEFAULT_COMPRESSION,
                 scheme = zlib.DEFLATED):
        # Use a negative number of bits, so the checksum is not included.
        compressor = zlib.compressobj(compression,scheme,bits)
        decompressor = zlib.decompressobj(bits)
        junk_offset = 0
        for line in training_set:
            junk_offset += len(line)
            # run the training line through the compressor and decompressor
            junk_offset -= len(decompressor.decompress(compressor.compress(line)))

        # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
        # not flushing wastes space.
        junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))

        self.junk_offset = junk_offset
        self.compressor = compressor
        self.decompressor = decompressor

    def compress(self,s):
        compressor = self.compressor.copy()
        return compressor.compress(s)+compressor.flush()

    def decompress(self,s):
        decompressor = self.decompressor.copy()
        return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]

Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.

def test_compress_small_strings():
    lines =[l for l in gzip.open(fname)]
    compressor=Trained_short_string_compressor(lines[:500])

    import time
    t = time.time()
    s = 0.0
    sc = 0.
    for i in range(10000):
        line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
        cl = compressor.compress(line)
        ucl = compressor.decompress(cl)
        s += len(line)
        sc+=len(cl)
        assert line == ucl

    print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
    print 'now, compare it ot a naive compression '
    t = time.time()
    for i in range(10000):
        line = lines[1000+i]
        cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
        cl=cr.compress(line)+cr.flush()
        ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
        sc += len(cl)
        assert line == ucl


    print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s

Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.

0 讨论(0)

查看其它7个回答