问题
I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton()
. I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu to src/rcppcuda.cpp, and comment out the code that defines and calls kernels). But as is, the compilation fails.
I also would like to know how to compile with a Makevars or Makevars.in instead of a Makefile, and in general, try to make this as platform independent as is realistic. I've read about Makevars in the R extensions manual, but I still haven't been able to make it work.
Some of you may suggest rCUDA, but what I'm really after here is improving a big package I've already been developing for some time, and I'm not sure that switching is worth starting again from scratch.
Anyway, here's what happens when I do an R CMD build
and R CMD INSTALL
on this one (master branch, commit entitled "fixed namespace").
* installing to library ‘/home/landau/.R/library’
* installing *source* package ‘rcppcuda’ ...
** libs
** arch -
/usr/local/cuda/bin/nvcc -c rcppcuda.cu -o rcppcuda.o --shared -Xcompiler "-fPIC" -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -I/apps/R-3.2.0/include -I/usr/local/cuda/include
rcppcuda.cu:1:18: error: Rcpp.h: No such file or directory
make: *** [rcppcuda.o] Error 1
ERROR: compilation failed for package ‘rcppcuda’
* removing ‘/home/landau/.R/library/rcppcuda’
...which is strange, because I do include Rcpp.h, and Rcpp is installed.
$ R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
...
> library(Rcpp)
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rcpp_0.11.6
>
I'm using CentOS,
$ cat /etc/*-release
CentOS release 6.6 (Final)
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
CentOS release 6.6 (Final)
CentOS release 6.6 (Final)
CUDA version 6,
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1
and I have access to 4 GPUs of the same make and model.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "Tesla M2070"
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 11 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
...
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU2) : No
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU3) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU2) : No
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU0) : Yes
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU1) : Yes
> Peer access from Tesla M2070 (GPU3) -> Tesla M2070 (GPU2) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 4, Device0 = Tesla M2070, Device1 = Tesla M2070, Device2 = Tesla M2070, Device3 = Tesla M2070
Result = PASS
Edit: it compiles for any commit after "fixed namespace" on either branch, but there are still problems with combining Rcpp and CUDA
To make the package compile, it turns out that I just needed to separate my C++ and CUDA code into separate *.cpp
and *.cu
files. However, when I try the "compiling cpp and cu separately" commit on the master branch, I get
> library(rcppcuda)
> hello()
An object of class "MyClass"
Slot "x":
[1] 1 2 3 4 5 6 7 8 9 10
Slot "y":
[1] 1 2 3 4 5 6 7 8 9 10
Error in .Call("someCPPcode", r) :
"someCPPcode" not resolved from current namespace (rcppcuda)
>
The error goes away in the withoutCUDA
branch in the commit entitled "adding branch withoutCUDA".
> library(rcppcuda)
> hello()
An object of class "MyClass"
Slot "x":
[1] 1 2 3 4 5 6 7 8 9 10
Slot "y":
[1] 1 2 3 4 5 6 7 8 9 10
[1] "Object changed."
An object of class "MyClass"
Slot "x":
[1] 500 2 3 4 5 6 7 8 9 10
Slot "y":
[1] 1 1000 3 4 5 6 7 8 9 10
>
The only differences between the "compiling cpp and cu separately" commit on master
and the "adding branch withoutCUDA" commit on withoutCUDA
are
- The Makefile and someCUDAcode.cu are gone from
withoutCUDA
. - In
withoutCUDA
, all references tosomeCUDAcode()
are gone from someCPPcode.cpp.
Also, it would still be convenient be able to use CUDA and Rcpp in the same *.cu
file. I would really like to know how to fix the "fixed namespace" commit on the master branch.
回答1:
Going through your package there are multiple aspects that need to be changed.
- You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds.
- Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better.
- Don't try and compile the shared object yourself, there are good macros within the base R makefile that make this much simpler (e.g. PKG_LIBS, OBJECTS, etc.)
- With multiple compilers, you will want to use the OBJECTS macro. Here you will override R's base attempt to set the object files to be linked (make sure you include them all).
- You also need (AFAIK) to make CUDA functions available with
extern "C"
. You will prefix both the function in the.cu
file and when you declare it at the start of yourcpp
file.
The following Makevars
worked for me whereby I modified my CUDA_HOME, R_HOME, and RCPP_INC (switched back for you). Note, this is where a configure
file is recommended to make the package as portable as possible.
CUDA_HOME = /usr/local/cuda
R_HOME = /apps/R-3.2.0
CXX = /usr/bin/g++
# This defines what the shared object libraries will be
PKG_LIBS= -L/usr/local/cuda-7.0/lib64 -Wl,-rpath,/usr/local/cuda-7.0/lib64 -lcudart -d
#########################################
R_INC = /usr/share/R/include
RCPP_INC = $(R_HOME)/library/Rcpp/include
NVCC = $(CUDA_HOME)/bin/nvcc
CUDA_INC = $(CUDA_HOME)/include
CUDA_LIB = $(CUDA_HOME)/lib64
LIBS = -lcudart -d
NVCC_FLAGS = -Xcompiler "-fPIC" -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -I$(R_INC)
### Define objects
cu_sources := $(wildcard *cu)
cu_sharedlibs := $(patsubst %.cu, %.o,$(cu_sources))
cpp_sources := $(wildcard *.cpp)
cpp_sharedlibs := $(patsubst %.cpp, %.o, $(cpp_sources))
OBJECTS = $(cu_sharedlibs) $(cpp_sharedlibs)
all : rcppcuda.so
rcppcuda.so: $(OBJECTS)
%.o: %.cpp $(cpp_sources)
$(CXX) $< -c -fPIC -I$(R_INC) -I$(RCPP_INC)
%.o: %.cu $(cu_sources)
$(NVCC) $(NVCC_FLAGS) -I$(CUDA_INC) $< -c
A follow-up point (as you say this is a learning exercise):
A. You aren't using one of the parts of Rcpp that make it such a wonderful package, namely 'attributes'. Here is how your cpp
file should look:
#include <Rcpp.h>
using namespace Rcpp;
extern "C"
void someCUDAcode();
//[[Rcpp::export]]
SEXP someCPPcode(SEXP r) {
S4 c(r);
double *x = REAL(c.slot("x"));
int *y = INTEGER(c.slot("y"));
x[0] = 500.0;
y[1] = 1000;
someCUDAcode();
return R_NilValue;
}
This will automatically generate the corresponding RcppExports.cpp
and RcppExports.R
files and you no longer need a .Call
function yourself. You just call the function. Now .Call('someCPPcode', r)
becomes someCPPcode(r)
:)
For completeness, here is the updated someCUDAcode.cu
file:
__global__ void mykernel(int a){
int id = threadIdx.x;
int b = a;
b++;
id++;
}
extern "C"
void someCUDAcode() {
mykernel<<<1, 1>>>(1);
}
With respect to a configure file (using autoconf), you are welcome to check out my gpuRcuda package using Rcpp, CUDA, and ViennaCL (a C++ GPU computing library).
回答2:
Several packages on CRAN use GPUs via CUDA:
- cudaBayesreg
- gmatrix
- gputools
- iFes
- permGPU
- rpud
I would start with these.
来源:https://stackoverflow.com/questions/30628333/building-a-tiny-r-package-with-cuda-and-rcpp