How to compress a String in Java?

前端未结

关注

 10  1552

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is long

相关标签:

10条回答

南笙

2020-11-28 06:09

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.

Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2020-11-28 06:10
If you know that your strings are mostly ASCII you could convert them to UTF-8.
```
byte[] bytes = string.getBytes("UTF-8");
```
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.

To convert back to a String:
```
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2020-11-28 06:15

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.

However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-11-28 06:15

Take a look at the Huffman algorithm.

https://codereview.stackexchange.com/questions/44473/huffman-code-implementation

The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).

You can read your entire text and build a table of codes, for example:

Symbol Code

a 0

s 10

e 110

m 111

The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.

But depending on your text, it could be effective.

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-28 06:16

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-28 06:18
Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.

If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
```
HashMap<String, String> toCompressed, toUncompressed;

String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
```
Clearly, this requires setup, and is only practical for a small number of strings.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页