tesla | 易学教程

Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

阅读更多关于 Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

问题 My development workstation(s) currently have NVIDIA Quadro K2200 and K620. Both of which have CUDA compute capability 5.0. However, the final production system has a Tesla K80 which has CUDA compute capability 3.7. Is it possible to install and develop CUDA programs for compute capability 3.7 on my Quadro GPUs and then move them to the K80 without having to make significant changes? 回答1: Yes, it's possible. Be sure not to use any compute capability 5.0+ specific features in your code, and you

Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

阅读更多关于 Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

阅读更多关于 Using CUDA compiled for compute capability 3.7 on Maxwell GPUs?

Concurrent Kernel Launch Example - CUDA

阅读更多关于 Concurrent Kernel Launch Example - CUDA

问题 I'm attempting to implement concurrent kernel launches for a very complex CUDA kernel, so I thought I'd start out with a simple example. It just launches a kernel which does a sum reduction. Simple enough. Here it is: #include <stdlib.h> #include <stdio.h> #include <time.h> #include <cuda.h> extern __shared__ char dsmem[]; __device__ double *scratch_space; __device__ double NDreduceSum(double *a, unsigned short length) { const int tid = threadIdx.x; unsigned short k = length; double *b; b =

CUDA unknown error

阅读更多关于 CUDA unknown error

问题 I'm trying to run mainSift.cpp from CudaSift on a Nvidia Tesla M2090. First of all, as explained in this question, I had to change from sm_35 to sm_20 the CMakeLists.txt . Unfortunatley now this error is returned: checkMsg() CUDA error: LaplaceMulti() execution failed in file </ghome/rzhengac/Downloads/CudaSift/cudaSiftH.cu>, line 318 : unknown error. And this is the LaplaceMulti code: double LaplaceMulti(cudaTextureObject_t texObj, CudaImage *results, float baseBlur, float diffScale, float

Python: How do we parallelize a python program to take advantage of a GPU server?

阅读更多关于 Python: How do we parallelize a python program to take advantage of a GPU server?

In our lab, we have NVIDIA Tesla K80 GPU accelerator computing with the following characteristics: Intel(R) Xeon(R) CPU E5-2670 v3 @2.30GHz, 48 CPU processors, 128GB RAM, 12 CPU cores running under Linux 64-bit. I am running the following code which does GridSearchCV after vertically appends different sets of dataframes into a single series of a RandomForestRegressor model. The two sample datasets I am considering are found in this link import sys import imp import glob import os import pandas as pd import math from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature

Python: How do we parallelize a python program to take advantage of a GPU server?

阅读更多关于 Python: How do we parallelize a python program to take advantage of a GPU server?

问题 In our lab, we have NVIDIA Tesla K80 GPU accelerator computing with the following characteristics: Intel(R) Xeon(R) CPU E5-2670 v3 @2.30GHz, 48 CPU processors, 128GB RAM, 12 CPU cores running under Linux 64-bit. I am running the following code which does GridSearchCV after vertically appends different sets of dataframes into a single series of a RandomForestRegressor model. The two sample datasets I am considering are found in this link import sys import imp import glob import os import

slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

阅读更多关于 slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice The Test program: The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding) Before the 1st cudaMalloc, I’ve called “cudaSetDevice” on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms on my server: Win2012+ Tesla K40, it takes 1100ms!! For both machines, subsequent cudaMalloc are much faster. My questions are: 1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st

slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

阅读更多关于 slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

问题 I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice The Test program: The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding) Before the 1st cudaMalloc, I’ve called “cudaSetDevice” on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms on my server: Win2012+ Tesla K40, it takes 1100ms!! For both machines, subsequent cudaMalloc