I was wondering if there is an official source, why the following works:
#include
struct Array{
int el[10000];
};
__device__ Array devAr;
What you are doing is invalid and you should listen to the warning:
a
__device__
variabledevAr
cannot be directly read in a host function
First let me simplify your code a bit to only size necessary to show the issue:
#include
__device__ int devAr[1];
int main()
{
devAr[0] = 4;
std::cout << devAr[0] << std::endl;
}
Now what is happening:
__device__ int devAr[1];
allocates fixed size array in device memory and stores the pointer to this device memory inside the devAr
variable (hence the warning).devAr
address points to valid piece of device memory, however, such address can be used even in host code, because host and device memory use the addresses in the same format. However, in host code devAr
points to some random uninitialized piece of host memory.devAr[0] = 4;
just writes 4
into some random uninitialized location in host memory.Try running the following code, perhaps it will help you understand what is happening under the hood:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include
using namespace std;
__device__ int devAr[1];
__global__ void foo()
{
printf("dev: %d \n", devAr[0]);
devAr[0] = 5;
printf("dev: %d \n", devAr[0]);
}
int main()
{
cout << "host: " << devAr[0] << endl;
devAr[0] = 4;
cout << "host: " << devAr[0] << endl;
foo << <1, 1 >> >();
cudaDeviceSynchronize();
cout << "host: " << devAr[0] << endl;
}
Output is going to be:
host: 0
host: 4
dev: 0
dev: 5
host: 4
UPDATE:
After clarifying what you are asking in the below comments I started digging in the issue and found couple of related SO threads, most of the quotations come from the comments below the answers, here they are:
cudaMemcpy() vs cudaMemcpyFromSymbol():
any statically defined device symbol (
__device__
,__constant__
, even textures) results in the toolchain emitting two symbols, one in the device module, the other in the host object. The CUDA runtime sets up and maintains a dynamic mapping between these two symbols. The symbol API calls are the way of retrieving this mapping for__constant__
and__device__
symbols. The texture APIs retrieve the mapping for the texture symbols, etc.
Usage of global vs. constant memory in CUDA:
*PNT
is a__device__
variable, not a host variable containing the address of a device variable. (Confusing, I know.) Therefore if you try to access it on the host as with(void**)&PNT
you are trying to read a device variable from the host which is not permitted. From the host code point of view it's just a symbol, so you need to usecudaGetSympolAddress()
to store the device address in a host variable that you can then pass tocudaMemcpyToSymbol()
, as @talonmies shows.
CUDA Constant Memory Error:
Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use
cudaGetSymbolAddress
to retrieve it at runtime.
cudaMemcpyToSymbol vs. cudaMemcpy why is it still around (cudaMemcpyToSymbol):
A copy to that address via the CUDA API would fail with an invalid argument error because it isn't an address in GPU memory space that the API had previously allocated. And yes, this applies to generic
__device__
pointers and statically declared device symbols as well.
cudaMemcpyFromSymbol on a __device__ variable:
The root of the problem is that you are not allowed to take the address of a device variable in ordinary host code: ... Although this seems to compile correctly, the actual address passed is garbage. To take the address of a device variable in host code, we can use
cudaGetSymbolAddress
Based on this evidence let me try to update my original 3step explanation from above:
__device__ int devAr[1];
allocates fixed size array in device memory and stores "hooks into a runtime device symbol lookup" into the host version of devAr
variable (see linked resources 1 and 3).devAr
address is just a garbage from host's point of view and should only be used with the symbol API calls, such as cudaGetSymbolAddress
(all of the linked resources appear to support this theory) because it maps to the device version of devAr
variable.I was not able to come up with anything "more concrete" such as link to CUDA documentation but I hope this is now clear enough. All in all it seems like you now have a guarantee for the behavior described above (i.e. there is a host and device version of devAr
variable) but to me it rather appears as an implementation detail which you should not rely on and should not use host version of devAr
variable for purposes other than symbol API calls.