As indicated in the comments there are several problems with your approach.
- As a beginner especially, you should always do error checking on your cuda calls (including kernel calls). In the code I have below is an example, or refer to this question/answer
- Creating a pointer-to-pointer arrangement in cuda is sometimes not intuitive, because the approach of cudaMalloc'ing the top level pointer and then cudaMalloc'ing the pointers underneath it will not work. This is because to cudaMalloc the pointers underneath it, we must pass the top level pointer to cudaMalloc, but this is already a device pointer. cudaMalloc expects you to pass a host pointer that it will then cudaMalloc to be on the device. So to address this it's usually necessary to create a shadow or parallel pointer arrangement on the host, and pass all these to cudaMalloc in succession, and then copy these pointers to the device. See my code below for an example.
- You also wanted to test for the validity of a device pointer on the host to see if you needed to cudaMalloc it. This won't work as it leads to dereferencing a device pointer on the host. Specifically at this line:
if(_devStackImagesCuda[i] == NULL)
, you are trying to see if _devStackImagesCuda[i] is valid, but in order to do this you must dereference _devStackImagesCuda
however you have previously done a cudaMalloc on this pointer (to a pointer) and so it is now a device pointer, which you are not allowed to dereference on the host. I suggest you keep track of whether you need to cudaMalloc these pointers some other way.
I believe something like this will work:
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
int main(){
unsigned char ** _devStackImagesCuda=0;
int stackSize = 5;
int imageSize = 4;
unsigned char *temp[stackSize];
unsigned char dummy_image[imageSize];
// first create top level pointer
if ( _devStackImagesCuda == 0) //allocate array of pointers on the device
{
cudaMalloc(&_devStackImagesCuda, sizeof(unsigned char*) * stackSize);
cudaCheckErrors("cm 1");
}
// then create child pointers on host, and copy to device, then copy image
for(int i = 0; i < stackSize; i++)
{
cudaMalloc(&temp[i], imageSize * sizeof(unsigned char));
cudaCheckErrors("cm 2");
cudaMemcpy(&(_devStackImagesCuda[i]), &(temp[i]), sizeof(unsigned char *), cudaMemcpyHostToDevice);//copy child pointer to device
cudaCheckErrors("cudamemcopy1");
cudaMemcpy(temp[i], dummy_image, imageSize*sizeof(unsigned char), cudaMemcpyHostToDevice); // copy image to device
cudaCheckErrors("cudamemcpy2");
}
return 0;
}
By the way, you could simplify things quite a bit if you can treat your array of images as a contiguous region. Like so:
unsigned char images[NUM_IMAGES*IMAGE_SIZE]; // or you could malloc this
unsigned char *d_images;
cudaMalloc((void **) d_images, NUM_IMAGES*IMAGE_SIZE*sizeof(unsigned char));
cudaMemcpy(d_images, images, NUM_IMAGES*IMAGE_SIZE*sizeof(unsigned char), cudaMemcpyHostToDevice);
and access individual image elements by:
unsigned char mypixel = images[i + (IMAGE_SIZE * j)]; // to access element i in image j