Is my understanding of AoS vs SoA advantages/disadvantages correct?

删除回忆录丶 提交于 2019-12-02 18:23:31

"traversing" just means looping over the data.

And yes, you're right about cache ways and collisions. 64B (cache line size) blocks of memory that are offset from each other by a large power of 2 map to the same set, and thus compete with each other for ways in that set, instead of being cached in different sets. (e.g. Intel's L1 data caches are 32kiB, 8-way associative, with 64B lines. There are 32kiB / 64 B/line = 512 lines grouped into 512 lines / 8 ways/set = 64 sets.

Loading 9 items offset from each other by 4kiB (64B/line * 64 sets, not coincidentally the page size) will evict the first one.

L2 and L3 caches are more highly associative, like 16 or 24 way, but still susceptible to "aliasing" like this, just like a hash table, where there's lots of demand for some sets (buckets) and no demand for other sets (buckets). For CPU caches, the "hash function" is almost always to use some of the address bits as an index, and to ignore the other bits. (The high bits of an address are used as the tag, to determine if any way in the set is actually caching the requested block, and the low bits are used to select bytes within the cache line.)


I think the SoA benefit is mostly from SIMD (auto-vectorization or manual), but also if you tend to loop over your data looking at only one or two fields from most structs, and only accessing the rest in rare cases where you find an interesting one based on one member.

A hybrid approach with separate arrays for each thing (or group of things) you look at together could make sense, with the rest of the data for each object in an array of structs. I'm imagining a linear search loop where most objects are rejected based on looking at one int field, but for the few objects that pass that test, you look at all the fields.

Grouping together the fields that are mostly accessed together gives you the benefit of spatial locality for these accesses, while still letting search loops that check the key field loop over contiguous memory (rather than with a big stride).


I'm currently experimenting with a layout that interleaves in SIMD vector sized groups. Most of the code that traverses the data needs all fields from every object, and doing it this way means the loop only needs one pointer, and all the memory is allocated as a single block.

This is for collision-detection masks (in a 2D space game (Endless Sky) where it's all collision between a line segment and a ship outline (traced automatically from the sprite), not between two polygons). Here's the original which looped over a vector of double x,y pairs (and used some (non-inline!) functions to operate on them as a 16B SIMD vector, often with slow SSE3 horizontal-add instructions and stuff like that :( ).

SSE2/SSE3 on XY pairs is probably better than nothing if you can't change the data layout, but changing the layout removes all the shuffling for doing 4 cross products in parallel. See the slides from this SIMD (SSE) intro at Insomniac Games (GDC 2015). It starts off with very basic stuff for people who haven't done anything with SIMD before, and explains exactly how structs of arrays are helpful. By the end, it gets to intermediate/advanced SSE techniques, so it's worth flipping through even if you already know some SIMD stuff. See also the tag wiki for some other links.


Anyway, this is the interleave data structure I came up with:

class Mask {
...

struct xy_interleave {
    static constexpr unsigned vecSize = 4;
    static constexpr unsigned alignMask = vecSize-1;
    alignas(64) float x[vecSize];
    float y[vecSize];
    // TODO: reduce cache footprint by calculating this on the fly, maybe with an unaligned load?
    float dx[vecSize]; // next - current;   next.x = x+dx
    float dy[vecSize];
};
std::vector<xy_interleave> outline_simd;

}

Then I can loop over it with stuff like (real code here: this is my work-in-progress not-cleaned-up code that's not ready to be sent upstream)

__m128 minus_point_ps = _mm_cvtpd_ps(-point);    // + is commutative, which helps the compiler with AVX
const __m128 minus_px = _mm_set1_ps(minus_point_ps[0]);
const __m128 minus_py = _mm_set1_ps(minus_point_ps[1]);
const __m128 range2 = _mm_set1_ps(float(range*range));

for(const xy_interleave &curr : outline_simd)
{
    __m128 dx = _mm_load_ps(curr.x) + minus_px;
    __m128 dy = _mm_load_ps(curr.y) + minus_py;
    // this is using GNU Vector Extensions for + and *, instead of _mm_add_ps and _mm_mul_ps, since GNU C++ defines __m128 in terms of __v4sf
    __m128 cmp = _mm_cmplt_ps(dx*dx - range2, dy*dy);  // transform the inequality for more ILP
    // load the x and y fields from this group of 4 objects, all of which come from the same cache line.

    if(_mm_movemask_ps(cmp))
        return true;
}

This compiles to really nice-looking asm loops, with only one pointer looping over the std::vector, and vector loads from constant offsets relative to that loop pointer.

However, scalar fallback loops over the same data are less pretty. (And actually I use loops like this (with j+=4) in the manually-vectorized parts, too, so I can change the interleave without breaking the code. It compiles away entirely, or turns into an unroll).

// TODO: write an iterator or something to make this suck less
for(const xy_interleave &curr : outline_simd)
    for (unsigned j = 0; j < curr.vecSize; ++j)
    {
        float dx = curr.x[j] - px;
        float dy = curr.y[j] - py;
        if(dx*dx + dy*dy < range2)
            return true;
    }

Unfortunately I've had no luck getting gcc or clang to auto-vectorize this, even for easy cases with no conditionals (e.g. just finding the min range from a query x,y to any point in the collision-mask, instead of checking if a point is within range).


I might discard this idea and go with separate x and y arrays. (Maybe packed head to tail in the same std::vector<float> (with an aligned allocator) to keep it part of one allocation, but that would still mean that loops would need separate x and y pointers because the offset between the x and y for a given vertex would be a runtime variable, not a compile-time constant.)

Having all the xs contiguous would be a big help if I want to stop storing the x[i+1]-x[i], and compute it on the fly. With my layout, I would need to shuffle between vectors, instead of just doing an unaligned offset by 1 float.

It would hopefully also allow the compiler to auto-vectorize some of the functions (e.g. for ARM, or for AVX/AVX2 with wider vectors).

Of course, manual vectorization is going to win here, since I'm doing stuff like XORing floats together because I only care about their sign bit as a truth value, instead of doing a compare and then XORing the compare result. (My testing so far has shown that treating negative 0 as negative still gives correct results for Mask::Intersect, but any way to express it in C is going to follow the IEEE rules where x >= 0 is true for x=-0.).


if you almost exclusively need to access a single field at a time on a large amount of data AoS is likely to be more performant while if you often need to access multiple fields from the same object or deal with single objects rather than many at once, SoA will be more performant.

You have this exactly backwards. Was this a typo? Grouping all the foo[i].key fields into a foo.key[i] array means they're all packed together in the cache, so accessing just that one field in many objects means you're using all 64 bytes of every cache line that you touch.

You got it correct earlier when you wrote

When working with only some of the data from many 'Person' records, only that data needs to be loaded into memory.

(except I think you mean "from" memory (into cache), unless you're talking about a memory-mapped file and faulting pages from disk into memory.)


Indexed addressing modes:

In a situation where you're looking at two or three fields in each object, an SoA layout is going to tie up more registers holding separate base addresses for each separate array you're looping over.

With multiple pointers, you're going to want to either use addressing modes like [reg1 + 4*reg2] on x86, or you're going to need to separately increment a bunch of different pointers inside your loop. Indexed addressing modes are potentially slightly slower on Intel SnB-family, because they can't stay micro-fused with ALU uops in the out-of-order core (only in the decoders and uop cache). Skylake can keep them micro-fused, but further testing is needed to find out when Intel made this change. Perhaps with Broadwell when three-input instructions beyond FMA (like CMOV and ADC) decoded to a single uop, but that's a pure guess. Testing on Haswell and Broadwell is needed.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!