multi-gpu | 易学教程

MPI Receive/Gather Dynamic Vector Length

阅读更多关于 MPI Receive/Gather Dynamic Vector Length

问题 I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system. I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors. class MachineData { unsigned long hostMemory; long

Poor performance when calling cudaMalloc with 2 GPUs simultaneously

阅读更多关于 Poor performance when calling cudaMalloc with 2 GPUs simultaneously

问题 I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread. Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU. As you can see a big portion of the GPU processing time is consumed by the two sorting operations

MPI Receive/Gather Dynamic Vector Length

阅读更多关于 MPI Receive/Gather Dynamic Vector Length

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system. I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors. class MachineData { unsigned long hostMemory; long cpuCores; int cudaDevices; public: std::vector<NviInfo> nviVec; std::vector<AmdInfo> amdVec; ... }; struct

Poor performance when calling cudaMalloc with 2 GPUs simultaneously

阅读更多关于 Poor performance when calling cudaMalloc with 2 GPUs simultaneously

I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread. Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU. As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key

Multiple monitors in .NET

阅读更多关于 Multiple monitors in .NET

问题 Are all displays returned from .NET's Screen.AllScreens regardless of hardware configuration? For example, on a single PC you can have: Video card out to two display = total 2 displays Video cards each out to 1 display = total 2 displays, Video cards each out to 2 displays = 6 displays, Eyefinity card out to 6 displays (on displayports) In all these cases, if I use Screen.AllScreens can I access each display individually? Also, what if I have a card in extended mode, meaning 2 displays

CUDA: Memory copy to GPU 1 is slower in multi-GPU

阅读更多关于 CUDA: Memory copy to GPU 1 is slower in multi-GPU

My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem. //#include <stdio.h> //#include <stdlib.h> //#include <cuda_runtime.h> #include <iostream> #include <fstream> #include <sstream> #include <string> #include <cutil.h> __global__ void test_kernel(float *d_data) { int tid = blockDim.x*blockIdx.x + threadIdx.x; for (int i=0;i<10000;++i) { d_data[tid] = float(i*2.2); d_data[tid] += 3.3; } } int main(int argc, char*

Tensorflow Java Multi-GPU inference

阅读更多关于 Tensorflow Java Multi-GPU inference

问题 I have a server with multiple GPUs and want to make full use of them during model inference inside a java app. By default tensorflow seizes all available GPUs, but uses only the first one. I can think of three options to overcome this issue: Restrict device visibility on process level, namely using CUDA_VISIBLE_DEVICES environment variable. That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea. Launch several sessions inside a

How'd multi-GPU programming work with Vulkan?

阅读更多关于 How'd multi-GPU programming work with Vulkan?

问题 Would using multi-GPUs in Vulkan be something like making many command queues then dividing command buffers between them? There are 2 problems: In OpenGL, we use GLEW to get functions. With more than 1 GPU, each GPU has its own driver. How'd we use Vulkan? Would part of the frame be generated with a GPU & the others with other GPUs like use Intel GPU to render UI & AMD or Nvidia GPU to render game screen in labtops for example? Or would a frame be generated in a GPU & the next frame in an

How to do multi GPU training with Keras?

阅读更多关于 How to do multi GPU training with Keras?

I want my model to run on multiple GPUs sharing parameters but with different batches of data. Can I do something like that with model.fit() ? Is there any other alternative? Keras now has (as of v2.0.9) in-built support for device parallelism, across multiple GPUs, using keras.utils.multi_gpu_model . Currently, only supports the Tensorflow back-end. Good example here (docs): https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus Also covered here: https://datascience.stackexchange.com/a/25737 try to use make_parallel function in: https://github.com/kuza55/keras

Can not save model using model.save following multi_gpu_model in Keras

阅读更多关于 Can not save model using model.save following multi_gpu_model in Keras

Following the upgrade to Keras 2.0.9, I have been using the multi_gpu_model utility but I can't save my models or best weights using model.save('path') The error I get is TypeError: can’t pickle module objects I suspect there is some problem gaining access to the model object. Is there a work around this issue? Workaround Here's a patched version that doesn't fail while saving: from keras.layers import Lambda, concatenate from keras import Model import tensorflow as tf def multi_gpu_model(model, gpus): if isinstance(gpus, (list, tuple)): num_gpus = len(gpus) target_gpu_ids = gpus else: num