问题

By small byte arrays I mean arrays of bytes with length from 10 up to 30.

By store I mean storing them in the RAM, not serializing and persisting to the filesystem.

System macOS 10.12.6, Oracle jdk1.8.0_141 64bit, JVM args -Xmx1g

Example: Expected behavior for new byte[200 * 1024 * 1024] is ≈200mb of the heap space

public static final int TARGET_SIZE = 200 * 1024 * 1024;
public static void main(String[] args) throws InterruptedException {
    byte[] arr = new byte[TARGET_SIZE];
    System.gc();
    System.out.println("Array size: " + arr.length);
    System.out.println("HeapSize: " + Runtime.getRuntime().totalMemory());
    Thread.sleep(60000);
}

However for smaller arrays math is not so simple

public static final int TARGET_SIZE = 200 * 1024 * 1024;
public static void main(String[] args) throws InterruptedException {
    final int oneArraySize = 20;
    final int numberOfArrays = TARGET_SIZE / oneArraySize;
    byte[][] arrays = new byte[numberOfArrays][];
    for (int i = 0; i < numberOfArrays; i++) {
        arrays[i] = new byte[oneArraySize];
    }
    System.gc();
    System.out.println("Arrays size: " + arrays.length);
    System.out.println("HeapSize: " + Runtime.getRuntime().totalMemory());
    Thread.sleep(60000);
}

And even worse

Question is

~~From where this overhead is coming?~~ How to efficiently store and work with small byte arrays (chunks of data)?

Update 1

for new byte[200*1024*1024][1] it eats

Basic math says that new byte[1] weights 24 bytes.

Update 2

According to What is the memory consumption of an object in Java? the minimum size of an object in Java is 16 bytes. From my previous "measurements" 24 bytes -4 bytes for int length -1 actual byte of my data = 3 bytes of some ~~other garbage~~ padding.

回答1:

The answer by Eugene explains the reason of why you are observing such an increase in memory consumption for a large number of arrays. The question in the title, "How to efficiently store small byte arrays in Java?", may then be answered with: Not at all. ¹

However, there probably are ways to achieve your goals. As usual, the "best" solution here will depend on how this data is going to be used. A very pragmatic approach would be: Define an interface for your data structure.

In the simplest case, this interface could just be

interface ByteArray2D 
{
    int getNumRows();
    int getNumColumns();
    byte get(int r, int c);
    void set(int r, int c, byte b);
}

Offering a basic abstraction of a "2D byte array". Depending on the application case, it may be beneficial to offer additional methods here. The patterns that could be employed here are frequently relevant for Matrix libraries, which handle "2D matrices" (usually of float values), and they often offer methods like these:

interface Matrix {
    Vector getRow(int row);
    Vector getColumn(int column);
    ...
}

However, when the main purpose here is to handle a set of byte[] arrays, methods for accessing each array (that is, each row of the 2D array) could be sufficient:

ByteBuffer getRow(int row);

Given this interface, it is simple to create different implementations. For example, you could create a simple implementation that just stores a 2D byte[][] array internally:

class SimpleByteArray2D implements ByteArray2D 
{
    private final byte array[][];
    ...
}

Alternatively, you could create an implementation that stores a 1D byte[] array, or analogously, a ByteBuffer internally:

class CompactByteArray2D implements ByteArray2D
{
    private final ByteBuffer buffer;
    ...
}

This implementation then just has to compute the (1D) index when calling one of the methods for accessing a certain row/column of the 2D array.

Below you will find a MCVE that shows this interface and the two implementations, the basic usage of the interface, and that does a memory footprint analysis using JOL.

The output of this program is:

For 10 rows and 1000 columns:
Total size for SimpleByteArray2D : 10240
Total size for CompactByteArray2D: 10088

For 100 rows and 100 columns:
Total size for SimpleByteArray2D : 12440
Total size for CompactByteArray2D: 10088

For 1000 rows and 10 columns:
Total size for SimpleByteArray2D : 36040
Total size for CompactByteArray2D: 10088

Showing that

the SimpleByteArray2D implementation that is based on a simple 2D byte[][] array requires more memory when the number of rows increases (even if the total size of the array remains constant)
the memory consumption of the CompactByteArray2D is independent of the structure of the array

The whole program:

package stackoverflow;

import java.nio.ByteBuffer;

import org.openjdk.jol.info.GraphLayout;

public class EfficientByteArrayStorage
{
    public static void main(String[] args)
    {
        showExampleUsage();
        anaylyzeMemoryFootprint();
    }

    private static void anaylyzeMemoryFootprint()
    {
        testMemoryFootprint(10, 1000);
        testMemoryFootprint(100, 100);
        testMemoryFootprint(1000, 10);
    }

    private static void testMemoryFootprint(int rows, int cols)
    {
        System.out.println("For " + rows + " rows and " + cols + " columns:");

        ByteArray2D b0 = new SimpleByteArray2D(rows, cols);
        GraphLayout g0 = GraphLayout.parseInstance(b0);
        System.out.println("Total size for SimpleByteArray2D : " + g0.totalSize());
        //System.out.println(g0.toFootprint());

        ByteArray2D b1 = new CompactByteArray2D(rows, cols);
        GraphLayout g1 = GraphLayout.parseInstance(b1);
        System.out.println("Total size for CompactByteArray2D: " + g1.totalSize());
        //System.out.println(g1.toFootprint());
    }

    // Shows an example of how to use the different implementations
    private static void showExampleUsage()
    {
        System.out.println("Using a SimpleByteArray2D");
        ByteArray2D b0 = new SimpleByteArray2D(10, 10);
        exampleUsage(b0);

        System.out.println("Using a CompactByteArray2D");
        ByteArray2D b1 = new CompactByteArray2D(10, 10);
        exampleUsage(b1);
    }

    private static void exampleUsage(ByteArray2D byteArray2D)
    {
        // Reading elements of the array
        System.out.println(byteArray2D.get(2, 4));

        // Writing elements of the array
        byteArray2D.set(2, 4, (byte)123);
        System.out.println(byteArray2D.get(2, 4));

        // Bulk access to rows
        ByteBuffer row = byteArray2D.getRow(2);
        for (int c = 0; c < row.capacity(); c++)
        {
            System.out.println(row.get(c));
        }

        // (Commented out for this MCVE: Writing one row to a file)
        /*/
        try (FileChannel fileChannel = 
            new FileOutputStream(new File("example.dat")).getChannel())
        {
            fileChannel.write(byteArray2D.getRow(2));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        //*/
    }

}


interface ByteArray2D 
{
    int getNumRows();
    int getNumColumns();
    byte get(int r, int c);
    void set(int r, int c, byte b);

    // Bulk access to rows, for convenience and efficiency
    ByteBuffer getRow(int row);
}

class SimpleByteArray2D implements ByteArray2D 
{
    private final int rows;
    private final int cols;
    private final byte array[][];

    public SimpleByteArray2D(int rows, int cols)
    {
        this.rows = rows;
        this.cols = cols;
        this.array = new byte[rows][cols];
    }

    @Override
    public int getNumRows()
    {
        return rows;
    }

    @Override
    public int getNumColumns()
    {
        return cols;
    }

    @Override
    public byte get(int r, int c)
    {
        return array[r][c];
    }

    @Override
    public void set(int r, int c, byte b)
    {
        array[r][c] = b;
    }

    @Override
    public ByteBuffer getRow(int row)
    {
        return ByteBuffer.wrap(array[row]);
    }
}

class CompactByteArray2D implements ByteArray2D
{
    private final int rows;
    private final int cols;
    private final ByteBuffer buffer;

    public CompactByteArray2D(int rows, int cols)
    {
        this.rows = rows;
        this.cols = cols;
        this.buffer = ByteBuffer.allocate(rows * cols);
    }

    @Override
    public int getNumRows()
    {
        return rows;
    }

    @Override
    public int getNumColumns()
    {
        return cols;
    }

    @Override
    public byte get(int r, int c)
    {
        return buffer.get(r * cols + c);
    }

    @Override
    public void set(int r, int c, byte b)
    {
        buffer.put(r * cols + c, b);
    }

    @Override
    public ByteBuffer getRow(int row)
    {
        ByteBuffer r = buffer.slice();
        r.position(row * cols);
        r.limit(row * cols + cols);
        return r.slice();
    }
}

Again, this is mainly intended as a sketch, to show one possible approach. The details of the interface will depend on the intended application pattern.

¹ A side note:

The problem of the memory overhead is similar in other languages. For example, in C/C++, the structure that most closely resembles a "2D Java array" would be an array of manually allocated pointers:

char** array;
array = new (char*)[numRows];
array[0] = new char[numCols];
...

In this case, you also have an overhead that is proportional to the number of rows - namely, one (usually 4 byte) pointer for each row.

回答2:

OK, so if I understood correctly (please ask if not - will try to answer), there are a couple of things here. First is that you need the right tool for measurements and JOL is the only one I trust.

Let' start simple:

byte[] two = new byte[1];
System.out.println(GraphLayout.parseInstance(one).toFootprint());

This will show 24 bytes (12 for mark and class words - or Object headers + 4 bytes padding), 1 byte for the actual value and 7 bytes for padding (memory is 8 bytes aligned).

Taking this into consideration, this should be a predictable output:

byte[] eight = new byte[8];
System.out.println(GraphLayout.parseInstance(eight).toFootprint()); // 24 bytes

byte[] nine = new byte[9];
System.out.println(GraphLayout.parseInstance(nine).toFootprint()); // 32 bytes

Now let's move to two dimensional arrays:

byte[][] ninenine = new byte[9][9];    
System.out.println(GraphLayout.parseInstance(ninenine).toFootprint()); // 344 bytes

System.out.println(ClassLayout.parseInstance(ninenine).toPrintable());

Since java does not have true two dimensional arrays; every nested array is itself an Object (byte[]) that has headers and content. Thus a single byte[9] has 32 bytes (12 headers + 4 padding) and 16 bytes for content (9 bytes for actual content + 7 bytes padding).

The ninenine object has 56 bytes total: 16 headers + 36 for keeping the references to the nine objects + 4 bytes for padding.

Look at the produced sample here:

byte[][] left = new byte[10000][10];
System.out.println(GraphLayout.parseInstance(left).toFootprint()); // 360016 bytes

byte[][] right = new byte[10][10000];
System.out.println(GraphLayout.parseInstance(right).toFootprint()); // 100216 bytes

That's a 260% increase; so by simply changing to work the other way around you can save a lot of space.

But the deeper problem is that every single Object in Java has those headers, there are no headerless objects yet. They might appear and are called Value Types. May be when that is implemented - arrays of primitives at least would not have this overhead.

来源：https://stackoverflow.com/questions/45829943/how-to-efficiently-store-small-byte-arrays-in-java

标签

java

arrays

memory

memory-efficient