Implementing in-memory compression for objects in Java

后端未结

关注

 8  1895

We have this use case where we would like to compress and store objects (in-memory) and decompress them as and when required.

The data we want to compress is quite

相关标签:

8条回答

南笙

2020-12-13 02:30

The best compression technology I know is ZIP. Java supports ZipStream. All you need is to serialize your object into byte array and then zip it.

Tips: Use ByteArrayOutputStream, DataStream, ZipOutputStream.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-13 02:35

Similar to previous answers, except I suggest you use DeflatorOutputStream and InflatorInputStream as these are simpler/faster/smaller than the alternatives. The reason it is smaller is it just does the compression whereas the alternatives add file format extensions like CRC checks and headers.

If size is important, you might like to have a simple serialization of your own. This is because ObjectOutputStream has a significant overhead making small objects much larger. (It improves for larger object especially when compressed)

e.g. an Integer takes 81 byte, and compression won't help much for such a small number of bytes. It is possible to cut this significantly.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-12-13 02:35
One proposal could be to use a combination of the following streams:
- ObjectOutputStream / ObjectInputStream for serializing/deserializing Java objects
- GZIPOutputStream / GZIPInputStream for compressing/uncompressing. There are other options to be found in the java.util.zip package.
- ByteArrayOutputStream / ByteArrayInputStream for storing the data in memory as a byte array
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-13 02:41

Compression of searilized objects in Java is usually not well... not so good.

First of all you need to understand that a Java object has a lot of additional information not needed. If you have millions of objects you have this overhead millions of times.

As an example lets us a double linked list. Each element has a previous and a next pointer + you store a long value (timestamp) + byte for the kind of interaction and two integers for the user ids. Since we use pointer compression we are 6Bytes * 2 + 8 + 4 * 2= 28Bytes. Java adds 8 Bytes + 12bytes for the padding. This makes 48Bytes per Element.

Now we create 10 million lists with 20 Elements each (time series of click events of users for the last three years (we want to find patterns)).

So we have 200Million * 48 Bytes of elements = 10GB memory (ok not much).

Ok beside the Garbage collection kills us and the overhead inside the JDK skyrocks, we end with 10GB memory.

Now lets use our own memory / object storage. We store it as a column wise data table where each object is actually a single row. So we have 200Million rows in a timestamp, previous, next, userIdA and userIdB collection.

Previous and next are now point to row ids and become 4byte (or 5bytes if we exceed 4billion entries (unlikely)).

So we have 8 + 4 + 4 + 4 + 4 => 24 * 200 Mio = 4.8GB + no GC problem.

Since the timestamp column stores the timestamps in a min max fashion and our timestamps all are within three years, we only need 5bytes to store each of the timestamps. Since the pointer are now stored relative (+ and -) and due the click series are timely closely related we only need 2bytes in average for both previous and next and for the user ids we use a dictionary since the click series are for roughly 500k users we only need three bytes each.

So we now have 5 + 2 + 2 + 3 + 3 => 15 * 200Mio => 3GB + Dictionary of 4 * 500k * 4 = 8MB = 3GB + 8MB. Sounds different to 10GB right?

But we are not finished yet. Since we now have no objects but rows and datas, we store each series as a table row and use special columns being collections of array that actually are storing 5 values and a pointer to the next five values + a pointer previous.

So we have 10Mio lists with 20 enries each (since we have overhead), we have per list 20 * (5 + 3 + 3) + 4 * 6 (lets add some overhead of partly filled elements) => 20 * 11 + 5 * 6 => 250 * 10Mio => 2,5GB + we can access the arrays faster than walking elements.

But hey its not over yet... the timestamps are now relatively stored only requiring 3 bytes per entry + 5 at the first entry. -> so we save a lot more 20 * 9 + 2 + 5 * 6 => 212 * 10Mio => 2,12 GB. And now storing it all to memory using gzip it and we result in 1GB since we can store it all lineary first storing the length of the array, all timestamps, all user ids making it very highly that there are patterns in the bits to be compressable. Since we use a dictionary we just sort it according the propability of each userId to be part of a series.

And since everything is a table you can deserialize everything in almost read speed so 1GB on a modern SSD cost 2 second to load. Try this with serialization / deserialization and you can hear inner user cry.

So before you ever compress serialized data, store it in tables, check each column / property if it can be logically be compressed. And finally have fun with it.

And remember 1TB (ECC) cost 10k today. Its nothing. And 1TB SSD 340 Euro. So do not waste your time on that issue unless you really have to.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-13 02:43

If you need to compress arbitrary objects, a possible approach is to serialize the object into a byte array, and then use e.g. the DEFLATE algorithm (the one used by GZIP) to compress it. When you need the object, you can decompress and deserialize it. Not sure about how efficient this would be, but it will be completely general.

0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-12-13 02:47

If you want to compress instances of MyObject you could have it implement Serializable and then stream the objects into a compressed byte array, like so:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream gzipOut = new GZIPOutputStream(baos);
ObjectOutputStream objectOut = new ObjectOutputStream(gzipOut);
objectOut.writeObject(myObj1);
objectOut.writeObject(myObj2);
objectOut.close();
byte[] bytes = baos.toByteArray();

Then to uncompress your byte[] back into the objects:

ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
GZIPInputStream gzipIn = new GZIPInputStream(bais);
ObjectInputStream objectIn = new ObjectInputStream(gzipIn);
MyObject myObj1 = (MyObject) objectIn.readObject();
MyObject myObj2 = (MyObject) objectIn.readObject();
objectIn.close();

0 讨论(0)

1 2 下一页