Really simple short string compression

China☆狼群 提交于 2019-12-17 09:46:33

问题


Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?

I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.


回答1:


I think the key question here is "Why do you want to compress URLs?"

Trying to shorten long urls for the address bar?

You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.

For example:

http://mydomain.com/folder1/folder2/page1.aspx

Could be shorted to:

http://mydomain.com/2d4f1c8a

Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.

Storing lots of URLs in memory or on disk?

Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.




回答2:


As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.

DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:

string[] orig = {
    "folder1/folder2/page1.aspx",
    "folderBB/folderAA/page2.aspx",
};
public void Run()
{
    foreach (string s in orig)
    {
        System.Console.WriteLine("original    : {0}", s);
        byte[] compressed = DeflateStream.CompressString(s);
        System.Console.WriteLine("compressed  : {0}", ByteArrayToHexString(compressed));
        string uncompressed = DeflateStream.UncompressString(compressed);
        System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
    }
}

Using that code, here are my test results:

original    : folder1/folder2/page1.aspx
compressed  : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx

original    : folderBB/folderAA/page2.aspx
compressed  : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx

So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.

You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.


EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.




回答3:


I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.




回答4:


What's your goal?

  • A shorter URL? Try URL shorteners like http://tinyurl.com/ or http://is.gd/
  • Storage space? Check out System.IO.Compression. (Or SharpZipLib)



回答5:


I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).

see http://blog.alivate.com.au/packed-url/




回答6:


I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/

Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....




回答7:


Have you tried just using gzip?

No idea if it would work effectively with such short strings, but I'd say its probably your best bet.




回答8:


The open source library SharpZipLib is easy to use and will provide you with compression tools




回答9:


You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations

This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.

And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.



来源:https://stackoverflow.com/questions/1192732/really-simple-short-string-compression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!