How to compress small strings

后端 未结 7 1291
没有蜡笔的小新
没有蜡笔的小新 2021-02-01 09:22

I have an sqlite database full of huge number of URLs and it\'s taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is

相关标签:
7条回答
  • 2021-02-01 09:56

    Have you considered using static Huffman coding?

    You could use your existing body of URLs to compute Huffmann codes over all bytes occuring in URLs according to their frequency. You can then store that set of codes once, and encode all URLs using it. I think it should give decent compression.

    0 讨论(0)
  • 2021-02-01 09:58

    How to you use the URL table?

    Do you usually do a "range scan" or unique id lookup only?

    If you won't do something like WHERE url like "/xxxx/question/%". You could use a hashed index rather then a b-tree index on varchar() to reduce the number of disk seeks.

    0 讨论(0)
  • 2021-02-01 09:59

    Is that 97 bytes, or 97 8-bit ASCII characters, or 97 16-bit Unicode characters?

    Assuming that all your URLs are legal URLs as per http://www.w3.org/Addressing/URL/url-spec.txt, then you should have only ASCII characters.

    If 97 16-bit Unicode characters simply storing the lower byte of each character will automatically give you 50% savings.

    If 97 8-bit characters, notice that you only need 7-bits. You can simply pass in 7 bits at a time into your bitstream and store that bitstream into your database; use some older 7-bit transmission protocol; or come up with your own adhoc way of storing every 8th character's bits in the high bits of the previous 7 characters.

    0 讨论(0)
  • 2021-02-01 10:08

    What's the format of your URLs?

    If any URL share one or more domain and you're sufficient with about 2 billion domain names you can create a pool for domain names. And if you have shared relative paths you can pool them to.

    For every URL in your database, split each URL into three parts. the scheme and domain e.g. http://mydomain.com the realtive url /my/path/ and then the rest mypage.html?id=4 (if you have query string parameters)

    This way you should reduce the overhead of every domain and relative path to just about 8 bytes. That should be better, and fast if you wanna lookup parts of URLs.

    Note: just the "http" scheme string itself is 4 bytes, you'll save anything beyond that on every domain entry. If every URL starts with "http://www." you'll save "://www." 7 bytes each time.

    Experiment a bit on how to split and structure URLs, I'm betting this is were you'll find your compression. Now, the remaining string that is not a common domain or relative path, what could you do with that?

    Compressing URLs

    General purpose compression such methods are derived from arithmetic encoding. Shannon the father of information theory wrote a paper about this in the 60's. I've been working with compression for a while and the one thing I've always found, is that general purpose compression never solves the actual problem.

    You're in luck, because URLs have structure and that structure you should utilize to better store your URLs.

    If you wanna apply a compression algorithm (I think the topic should be changed to reflect URL compression, because it's domain specific) you'll have to examine the entropy of your data. Because it will tell you something about the yield of storage. URLs are ASCII characters, any character not within the ASCII range 0x20-0x7E won't be occurring and throwing away case sensitivity, you're down to a mere 63 distinct states. !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ including white-space.

    You could create a frequency table of the remaining characters and perform arithmetic encoding. You know that you'll need at most 6-bits, which means for every character in your URL database you're wasting 2 bits right now, and if you just shifted things into place and used a lookup table, you'd get your 20% compression. Just like that ;)

    Because the data is so specific it's really not a good idea to just compress with general purpose methods. It's better to structure the information and split that into pieces of data you can store more efficiently. You know a lot about the domain, use that knowledge to compress your data.

    0 讨论(0)
  • 2021-02-01 10:08

    Abstract:

    A common problem of large scale search engines and web spiders is how to handle a huge number of encountered URLs. Traditional search engines and web spiders use hard disk to store URLs without any compression. This results in slow performance and more space requirement. This paper describes a simple URL compression algorithm allowing efficient compression and decompression. The compression algorithm is based on a delta encoding scheme to extract URLs sharing common prefixes and an AVL tree to get efficient search speed. Our results show that the 50 % of size reduction is achieved. 1.

    -- Kasom Koht-arsa Department of Computer Engineering.

    Resource

    0 讨论(0)
  • 2021-02-01 10:17

    I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.

    First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.

    Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.

    Here my the python code (apologies for the ugly testing):

    import zlib
    class Trained_short_string_compressor(object):
        def __init__(self,
                     training_set, 
                     bits = -zlib.MAX_WBITS,
                     compression = zlib.Z_DEFAULT_COMPRESSION,
                     scheme = zlib.DEFLATED):
            # Use a negative number of bits, so the checksum is not included.
            compressor = zlib.compressobj(compression,scheme,bits)
            decompressor = zlib.decompressobj(bits)
            junk_offset = 0
            for line in training_set:
                junk_offset += len(line)
                # run the training line through the compressor and decompressor
                junk_offset -= len(decompressor.decompress(compressor.compress(line)))
    
            # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
            # not flushing wastes space.
            junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
    
            self.junk_offset = junk_offset
            self.compressor = compressor
            self.decompressor = decompressor
    
        def compress(self,s):
            compressor = self.compressor.copy()
            return compressor.compress(s)+compressor.flush()
    
        def decompress(self,s):
            decompressor = self.decompressor.copy()
            return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]
    

    Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.

    def test_compress_small_strings():
        lines =[l for l in gzip.open(fname)]
        compressor=Trained_short_string_compressor(lines[:500])
    
        import time
        t = time.time()
        s = 0.0
        sc = 0.
        for i in range(10000):
            line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
            cl = compressor.compress(line)
            ucl = compressor.decompress(cl)
            s += len(line)
            sc+=len(cl)
            assert line == ucl
    
        print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
        print 'now, compare it ot a naive compression '
        t = time.time()
        for i in range(10000):
            line = lines[1000+i]
            cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
            cl=cr.compress(line)+cr.flush()
            ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
            sc += len(cl)
            assert line == ucl
    
    
        print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s 
    

    Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.

    0 讨论(0)
提交回复
热议问题