Access CUDA global device variable from host

后端 未结 1 1445
情话喂你
情话喂你 2021-02-14 16:31

I was wondering if there is an official source, why the following works:

#include 

struct Array{
    int el[10000];
};

__device__ Array devAr;
         


        
1条回答
  •  一向
    一向 (楼主)
    2021-02-14 16:59

    What you are doing is invalid and you should listen to the warning:

    a __device__ variable devAr cannot be directly read in a host function

    First let me simplify your code a bit to only size necessary to show the issue:

    #include 
    
    __device__ int devAr[1];
    
    int main()
    {
        devAr[0] = 4;
        std::cout << devAr[0] << std::endl;
    }
    

    Now what is happening:

    1. __device__ int devAr[1]; allocates fixed size array in device memory and stores the pointer to this device memory inside the devAr variable (hence the warning).
    2. The devAr address points to valid piece of device memory, however, such address can be used even in host code, because host and device memory use the addresses in the same format. However, in host code devAr points to some random uninitialized piece of host memory.
    3. Based on above one can say that devAr[0] = 4; just writes 4 into some random uninitialized location in host memory.

    Try running the following code, perhaps it will help you understand what is happening under the hood:

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include 
    
    using namespace std;
    
    __device__ int devAr[1];
    
    __global__ void foo()
    {
        printf("dev: %d \n", devAr[0]);
        devAr[0] = 5;
        printf("dev: %d \n", devAr[0]);
    }
    
    int main()
    {
        cout << "host: " << devAr[0] << endl;
        devAr[0] = 4;
        cout << "host: " << devAr[0] << endl;
    
        foo << <1, 1 >> >();
        cudaDeviceSynchronize();
        cout << "host: " << devAr[0] << endl;
    }
    

    Output is going to be:

    host: 0
    host: 4
    dev: 0
    dev: 5
    host: 4
    

    UPDATE:

    After clarifying what you are asking in the below comments I started digging in the issue and found couple of related SO threads, most of the quotations come from the comments below the answers, here they are:

    1. cudaMemcpy() vs cudaMemcpyFromSymbol():

      any statically defined device symbol (__device__, __constant__, even textures) results in the toolchain emitting two symbols, one in the device module, the other in the host object. The CUDA runtime sets up and maintains a dynamic mapping between these two symbols. The symbol API calls are the way of retrieving this mapping for __constant__ and __device__ symbols. The texture APIs retrieve the mapping for the texture symbols, etc.

    2. Usage of global vs. constant memory in CUDA:

      *PNT is a __device__ variable, not a host variable containing the address of a device variable. (Confusing, I know.) Therefore if you try to access it on the host as with (void**)&PNT you are trying to read a device variable from the host which is not permitted. From the host code point of view it's just a symbol, so you need to use cudaGetSympolAddress() to store the device address in a host variable that you can then pass to cudaMemcpyToSymbol(), as @talonmies shows.

    3. CUDA Constant Memory Error:

      Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.

    4. cudaMemcpyToSymbol vs. cudaMemcpy why is it still around (cudaMemcpyToSymbol):

      A copy to that address via the CUDA API would fail with an invalid argument error because it isn't an address in GPU memory space that the API had previously allocated. And yes, this applies to generic __device__ pointers and statically declared device symbols as well.

    5. cudaMemcpyFromSymbol on a __device__ variable:

      The root of the problem is that you are not allowed to take the address of a device variable in ordinary host code: ... Although this seems to compile correctly, the actual address passed is garbage. To take the address of a device variable in host code, we can use cudaGetSymbolAddress

    Based on this evidence let me try to update my original 3step explanation from above:

    1. __device__ int devAr[1]; allocates fixed size array in device memory and stores "hooks into a runtime device symbol lookup" into the host version of devAr variable (see linked resources 1 and 3).
    2. The devAr address is just a garbage from host's point of view and should only be used with the symbol API calls, such as cudaGetSymbolAddress (all of the linked resources appear to support this theory) because it maps to the device version of devAr variable.

    I was not able to come up with anything "more concrete" such as link to CUDA documentation but I hope this is now clear enough. All in all it seems like you now have a guarantee for the behavior described above (i.e. there is a host and device version of devAr variable) but to me it rather appears as an implementation detail which you should not rely on and should not use host version of devAr variable for purposes other than symbol API calls.

    0 讨论(0)
提交回复
热议问题