How to compress small strings

后端 未结 7 1289
没有蜡笔的小新
没有蜡笔的小新 2021-02-01 09:22

I have an sqlite database full of huge number of URLs and it\'s taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is

7条回答
  •  借酒劲吻你
    2021-02-01 10:17

    I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.

    First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.

    Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.

    Here my the python code (apologies for the ugly testing):

    import zlib
    class Trained_short_string_compressor(object):
        def __init__(self,
                     training_set, 
                     bits = -zlib.MAX_WBITS,
                     compression = zlib.Z_DEFAULT_COMPRESSION,
                     scheme = zlib.DEFLATED):
            # Use a negative number of bits, so the checksum is not included.
            compressor = zlib.compressobj(compression,scheme,bits)
            decompressor = zlib.decompressobj(bits)
            junk_offset = 0
            for line in training_set:
                junk_offset += len(line)
                # run the training line through the compressor and decompressor
                junk_offset -= len(decompressor.decompress(compressor.compress(line)))
    
            # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
            # not flushing wastes space.
            junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
    
            self.junk_offset = junk_offset
            self.compressor = compressor
            self.decompressor = decompressor
    
        def compress(self,s):
            compressor = self.compressor.copy()
            return compressor.compress(s)+compressor.flush()
    
        def decompress(self,s):
            decompressor = self.decompressor.copy()
            return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]
    

    Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.

    def test_compress_small_strings():
        lines =[l for l in gzip.open(fname)]
        compressor=Trained_short_string_compressor(lines[:500])
    
        import time
        t = time.time()
        s = 0.0
        sc = 0.
        for i in range(10000):
            line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
            cl = compressor.compress(line)
            ucl = compressor.decompress(cl)
            s += len(line)
            sc+=len(cl)
            assert line == ucl
    
        print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
        print 'now, compare it ot a naive compression '
        t = time.time()
        for i in range(10000):
            line = lines[1000+i]
            cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
            cl=cr.compress(line)+cr.flush()
            ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
            sc += len(cl)
            assert line == ucl
    
    
        print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s 
    

    Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.

提交回复
热议问题