How to compress small strings

后端 未结 7 1290
没有蜡笔的小新
没有蜡笔的小新 2021-02-01 09:22

I have an sqlite database full of huge number of URLs and it\'s taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is

7条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-02-01 10:08

    What's the format of your URLs?

    If any URL share one or more domain and you're sufficient with about 2 billion domain names you can create a pool for domain names. And if you have shared relative paths you can pool them to.

    For every URL in your database, split each URL into three parts. the scheme and domain e.g. http://mydomain.com the realtive url /my/path/ and then the rest mypage.html?id=4 (if you have query string parameters)

    This way you should reduce the overhead of every domain and relative path to just about 8 bytes. That should be better, and fast if you wanna lookup parts of URLs.

    Note: just the "http" scheme string itself is 4 bytes, you'll save anything beyond that on every domain entry. If every URL starts with "http://www." you'll save "://www." 7 bytes each time.

    Experiment a bit on how to split and structure URLs, I'm betting this is were you'll find your compression. Now, the remaining string that is not a common domain or relative path, what could you do with that?

    Compressing URLs

    General purpose compression such methods are derived from arithmetic encoding. Shannon the father of information theory wrote a paper about this in the 60's. I've been working with compression for a while and the one thing I've always found, is that general purpose compression never solves the actual problem.

    You're in luck, because URLs have structure and that structure you should utilize to better store your URLs.

    If you wanna apply a compression algorithm (I think the topic should be changed to reflect URL compression, because it's domain specific) you'll have to examine the entropy of your data. Because it will tell you something about the yield of storage. URLs are ASCII characters, any character not within the ASCII range 0x20-0x7E won't be occurring and throwing away case sensitivity, you're down to a mere 63 distinct states. !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ including white-space.

    You could create a frequency table of the remaining characters and perform arithmetic encoding. You know that you'll need at most 6-bits, which means for every character in your URL database you're wasting 2 bits right now, and if you just shifted things into place and used a lookup table, you'd get your 20% compression. Just like that ;)

    Because the data is so specific it's really not a good idea to just compress with general purpose methods. It's better to structure the information and split that into pieces of data you can store more efficiently. You know a lot about the domain, use that knowledge to compress your data.

提交回复
热议问题