How to structure data for optimal speed in a CUDA app

前端 未结 1 1221
孤独总比滥情好
孤独总比滥情好 2021-02-08 18:27

I am attempting to write a simple particle system that leverages CUDA to do the updating of the particle positions. Right now I am defining a particle has an object with a posi

1条回答
  •  盖世英雄少女心
    2021-02-08 18:35

    It's common in data parallel programming to talk about "Struct of Arrays" (SOA) versus "Array of Structs" (AOS), where the first of your two examples is AOS and the second is SOA. Many parallel programming paradigms, in particular SIMD-style paradigms, will prefer SOA.

    In GPU programming, the reason that SOA is typically preferred is to optimise the accesses to the global memory. You can view the recorded presentation on Advanced CUDA C from GTC last year for a detailed description of how the GPU accesses memory.

    The main point is that memory transactions have a minimum size of 32 bytes and you want to maximise the efficiency of each transaction.

    With AOS:

    position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
    //  ^ write to every third address                    ^ read from every third address
    //                           ^ read from every third address
    

    With SOA:

    position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
    //  ^ write to consecutive addresses                  ^ read from consecutive addresses
    //                           ^ read from consecutive addresses
    

    In the second case, reading from consecutive addresses means that you have 100% efficiency versus 33% in the first case. Note that on older GPUs (compute capability 1.0 and 1.1) the situation is much worse (13% efficiency).

    There is one other possibility - if you had two or four floats in the struct then you could read the AOS with 100% efficiency:

    float4 lpos;
    float4 lvel;
    lpos = position[base + tid];
    lvel = velocity[base + tid];
    lpos.x += lvel.x * dt;
    //...
    position[base + tid] = lpos;
    

    Again, check out the Advanced CUDA C presentation for the details.

    0 讨论(0)
提交回复
热议问题