Building a tiny R package with CUDA and Rcpp

问题

I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu to src/rcppcuda.cpp, and comment out the code that defines and calls kernels). But as is, the compilation fails.

I also would like to know how to compile with a Makevars or Makevars.in instead of a Makefile, and in general, try to make this as platform independent as is realistic. I've read about Makevars in the R extensions manual, but I still haven't been able to make it work.

Some of you may suggest rCUDA, but what I'm really after here is improving a big package I've already been developing for some time, and I'm not sure that switching is worth starting again from scratch.

Anyway, here's what happens when I do an R CMD build and R CMD INSTALL on this one (master branch, commit entitled "fixed namespace").

* installing to library ‘/home/landau/.R/library’
* installing *source* package ‘rcppcuda’ ...
** libs
** arch - 
/usr/local/cuda/bin/nvcc -c rcppcuda.cu -o rcppcuda.o --shared -Xcompiler "-fPIC" -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -I/apps/R-3.2.0/include -I/usr/local/cuda/include 
rcppcuda.cu:1:18: error: Rcpp.h: No such file or directory
make: *** [rcppcuda.o] Error 1
ERROR: compilation failed for package ‘rcppcuda’
* removing ‘/home/landau/.R/library/rcppcuda’

...which is strange, because I do include Rcpp.h, and Rcpp is installed.

$ R

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

...

> library(Rcpp)
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcpp_0.11.6
>

I'm using CentOS,

$ cat /etc/*-release
CentOS release 6.6 (Final)
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
CentOS release 6.6 (Final)
CentOS release 6.6 (Final)

CUDA version 6,

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1

and I have access to 4 GPUs of the same make and model.

$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery 
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla M2070"
  CUDA Driver Version / Runtime Version          6.0 / 6.0
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 5375 MBytes (5636554752 bytes)
  (14) Multiprocessors, ( 32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock rate:                                1147 MHz (1.15 GHz)
  Memory Clock rate:                             1566 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           11 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

...

> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU2) : No
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU2) : No
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 4, Device0 = Tesla M2070, Device1 = Tesla M2070, Device2 = Tesla M2070, Device3 = Tesla M2070
Result = PASS

Edit: it compiles for any commit after "fixed namespace" on either branch, but there are still problems with combining Rcpp and CUDA

To make the package compile, it turns out that I just needed to separate my C++ and CUDA code into separate *.cpp and *.cu files. However, when I try the "compiling cpp and cu separately" commit on the master branch, I get

> library(rcppcuda)
> hello()
An object of class "MyClass"
Slot "x":
 [1]  1  2  3  4  5  6  7  8  9 10

Slot "y":
 [1]  1  2  3  4  5  6  7  8  9 10

Error in .Call("someCPPcode", r) : 
  "someCPPcode" not resolved from current namespace (rcppcuda)
>

The error goes away in the withoutCUDA branch in the commit entitled "adding branch withoutCUDA".

> library(rcppcuda)
> hello()
An object of class "MyClass"
Slot "x":
 [1]  1  2  3  4  5  6  7  8  9 10

Slot "y":
 [1]  1  2  3  4  5  6  7  8  9 10

[1] "Object changed."
An object of class "MyClass"
Slot "x":
 [1] 500   2   3   4   5   6   7   8   9  10

Slot "y":
 [1]    1 1000    3    4    5    6    7    8    9   10

>

The only differences between the "compiling cpp and cu separately" commit on master and the "adding branch withoutCUDA" commit on withoutCUDA are

The Makefile and someCUDAcode.cu are gone from withoutCUDA.
In withoutCUDA, all references to someCUDAcode() are gone from someCPPcode.cpp.

Also, it would still be convenient be able to use CUDA and Rcpp in the same *.cu file. I would really like to know how to fix the "fixed namespace" commit on the master branch.

回答1:

Going through your package there are multiple aspects that need to be changed.

You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds.
Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better.
Don't try and compile the shared object yourself, there are good macros within the base R makefile that make this much simpler (e.g. PKG_LIBS, OBJECTS, etc.)
With multiple compilers, you will want to use the OBJECTS macro. Here you will override R's base attempt to set the object files to be linked (make sure you include them all).
You also need (AFAIK) to make CUDA functions available with extern "C". You will prefix both the function in the .cu file and when you declare it at the start of your cpp file.

The following Makevars worked for me whereby I modified my CUDA_HOME, R_HOME, and RCPP_INC (switched back for you). Note, this is where a configure file is recommended to make the package as portable as possible.

CUDA_HOME = /usr/local/cuda
R_HOME = /apps/R-3.2.0
CXX = /usr/bin/g++

# This defines what the shared object libraries will be
PKG_LIBS= -L/usr/local/cuda-7.0/lib64 -Wl,-rpath,/usr/local/cuda-7.0/lib64 -lcudart -d


#########################################

R_INC = /usr/share/R/include
RCPP_INC = $(R_HOME)/library/Rcpp/include

NVCC = $(CUDA_HOME)/bin/nvcc
CUDA_INC = $(CUDA_HOME)/include 
CUDA_LIB = $(CUDA_HOME)/lib64

LIBS = -lcudart -d
NVCC_FLAGS = -Xcompiler "-fPIC" -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -I$(R_INC)

### Define objects
cu_sources := $(wildcard *cu)
cu_sharedlibs := $(patsubst %.cu, %.o,$(cu_sources))

cpp_sources := $(wildcard *.cpp)
cpp_sharedlibs := $(patsubst %.cpp, %.o, $(cpp_sources))

OBJECTS = $(cu_sharedlibs) $(cpp_sharedlibs)

all : rcppcuda.so

rcppcuda.so: $(OBJECTS)

%.o: %.cpp $(cpp_sources)
        $(CXX) $< -c -fPIC -I$(R_INC) -I$(RCPP_INC)

%.o: %.cu $(cu_sources)
        $(NVCC) $(NVCC_FLAGS) -I$(CUDA_INC) $< -c

A follow-up point (as you say this is a learning exercise):

A. You aren't using one of the parts of Rcpp that make it such a wonderful package, namely 'attributes'. Here is how your cpp file should look:

#include <Rcpp.h>
using namespace Rcpp;

extern "C"
void someCUDAcode();

//[[Rcpp::export]]
SEXP someCPPcode(SEXP r) {
  S4 c(r);
  double *x = REAL(c.slot("x"));
  int *y = INTEGER(c.slot("y"));
  x[0] = 500.0;
  y[1] = 1000;
  someCUDAcode();
  return R_NilValue;
}

This will automatically generate the corresponding RcppExports.cpp and RcppExports.R files and you no longer need a .Call function yourself. You just call the function. Now .Call('someCPPcode', r) becomes someCPPcode(r) :)

For completeness, here is the updated someCUDAcode.cu file:

__global__ void mykernel(int a){
  int id = threadIdx.x;
  int b = a;
  b++;
  id++;
}


extern "C"
void someCUDAcode() {
  mykernel<<<1, 1>>>(1);
}

With respect to a configure file (using autoconf), you are welcome to check out my gpuRcuda package using Rcpp, CUDA, and ViennaCL (a C++ GPU computing library).

回答2:

Several packages on CRAN use GPUs via CUDA:

cudaBayesreg
gmatrix
gputools
iFes
permGPU
rpud

I would start with these.

来源：https://stackoverflow.com/questions/30628333/building-a-tiny-r-package-with-cuda-and-rcpp

标签

cuda

rcpp