Java Performance: multiple array vs single array of custom object [closed]

允我心安 提交于 2021-02-16 15:37:18

问题


What's the best performance of these two different code solution:

class Meme {
  private int[] a = new int[100];
  private int[] b = new int[100];
  private int[] c = new int[100];
}

vs

class Meme {
   private MemeProp[] prop = new MemeProp[100];
   class MemeProp {
     int a;
     int b;
     int c;
   }
}

consider in continuous access for read and write properties a, b, c

I need to write code for fast execution not for memory optimization. Therefore, my performance benchmark is execution time


回答1:


It depends a lot on your memory access patterns.

The first one is definitely more compact.

User-defined types in Java carry some overhead, something like a pointer overhead per object (8 bytes on 64-bit). An Integer can take 16 bytes (8 bytes for object + 4 bytes for int + 4 for alignment), e.g., while an int takes a mere 4 bytes. It's analogous to a class with virtual functions in C++ storing a vptr.

Given this, if we look at the memory usage of MemeProp, we have something like this:

  class MemeProp {
     // invisible 8 byte pointer with 8-byte alignment requirements
     int a; // 4 bytes
     int b; // 4 bytes
     int c; // 4 bytes
     // 4 bytes of padding for alignment of invisible field
   }

The resulting memory size is 24 bytes per MemeProp instance. When we take a hundred of those, we end up with a total memory usage of 2400 bytes.

Meanwhile, your 3 arrays each containing a hundred ints will only require slightly over 1200 bytes (a little teeny bit extra for array overhead storing a length and pointer). That's very close to half the size of your second version.

Sequential Access

When you process data sequentially, speed and size often tend to go hand-in-hand. If more data can fit into a page and cache line, your code will typically consume it much faster in cases where the machine instructions don't vary much between the larger vs. tighter representation.

So from a sequential access standpoint, your first version which requires half the memory is likely to go quite a bit faster (maybe almost twice as fast in some cases).

Random Access

Yet random access is a different case. Let's say a, b, and c are equally hot fields, always accessed together in your tight loops which have a random access pattern into this structure.

In that case, your second version may actually fare better. It's because it offers a contiguous layout for a MemeProp object where a, b, and c will end up being right next to each other in memory, always (no matter how the garbage collector ends up rearranging the memory layout for a MemeProp instance).

With your first version, your a, b, and c arrays are spread out in memory. The stride between them can never be smaller than 400 bytes. This equates to potentially a lot more cache misses if you end up accessing some random element, say the 66th element, when we access a[65], b[65], and c[65]. If that's the first time we're accessing those fields, we'll end up with 3 cache misses. Then we might access a[7], b[7], and c[7], and those would all relatively be 228 bytes apart from a[65], b[65], and c[65], and we might end up with 3 more cache misses.

Possibly Better than Both

Let's say you need random AoS-style access and all fields are always accessed together. In that case, the most optimal representation will likely be this:

class Meme {
    private int[] abc = new int[100 * 3];
}

This ends up taking the minimum amount of memory of all three solutions and guarantees that abc fields for a single MemeProp are right next to each other.

Of course, there might be some cases where your mileage may vary, but this might be the strongest candidate among these three if you need both random and sequential access.

Hot/Cold Field Splitting

For completeness, let's consider a case where your memory access patterns are sequential but not all fields (a/b/c) are accessed together. Instead, you have one performance-critical loop which accesses a and b together, and some non-performance-critical code which accesses only c. In that case, you might get the best results from a representation like this:

class Meme {
    private int[] ab = new int[100 * 2];
    private int[] c = new int[100];
}

This makes it so our data looks like this:

abababab...
ccccc...

... as opposed to this:

abcabcabcabc...

In this case, by hoisting out c and putting it into a separate array, it is no longer interleaved with a and b fields, allowing the computer to "consume" this relevant data (a and b in those performance-critical loops) at a faster rate as it's moving contiguous chunks of this memory into faster but smaller forms of memory (physically-mapped page, CPU cache line).

SoA Access Pattern

Finally, let's say you are accessing all fields separately. Each critical loop only accesses a, only accesses b, or only accesses c. In that case, your first representation is likely to be the fastest, especially if your compiler can emit efficient SIMD instructions which can vectorize processing of multiple such fields in parallel.

Relevant Data in Cache Lines

If you find all of this confusing, I don't blame you, but there was something harold, a computer architecture wizard, told me once on this site. He summed it all up in the most elegant way that the goal should be to avoid loading irrelevant data in a cache line which is only going to be loaded and evicted without being used. As much as I had developed some intuition of that over all my profiling sessions, I never found such a concise and elegant way to put it that made sense of all those cache misses.

Our hardware and operating systems want to move memory from bigger but slower forms of memory to smaller but faster forms of memory. When it does that, it tends to "grab memory by the handful". Well, if you are trying to grab M&Ms by the handful from a bowl but you're only interested in eating green M&Ms, it becomes very inefficient to grab a handful of mixed M&Ms only to pick out the green ones and then return all the other ones to the bowl. It becomes so much more efficient in that case if you had a bowl filled with only green M&Ms, and that's kind of the goal when you are trying to settle on an efficient memory layout if I use a very crude but hopefully helpful analogy. If all you want to access in a critical loop are the analogical green M&Ms, don't mix them up (interleave that data) with red ones, blue ones, yellow ones, etc. Instead keep all those green ones right next to each other in memory so that when you grab things by the handful, you're only getting what you want.

Data-Oriented Design

One of the things you are doing correctly if you anticipate a large input, loopy scenario for these MemeProps is designing your outer, public interface at the collection level, at the Meme level and turning MemeProp fields into a private detail.

That's probably the most effective strategy before you measure is identifying the places where you process data in bulk (though 100 isn't exactly bulk, I'm hoping your actual scenario is much bigger), and designing your public interfaces accordingly.

For example, if you are designing an Image class and performance is a key goal, then you want to avoid exposing a Pixel object which provides operations on a pixel-by-pixel basis. Far better is designing this interface at the bulk Image or Scanline level, allowing a bunch of pixels to be processed in bulk.

That leaves you a lot more wiggle room to measure and tune data representations than a design which has ten thousand client dependencies to some granular Pixel object interface which represents a single pixel, e.g.

So anyway, safest bet is to measure, but it's good that you are designing at an appropriately-coarse level for your interface designs.



来源:https://stackoverflow.com/questions/34486450/java-performance-multiple-array-vs-single-array-of-custom-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!