Mxnet - slow array copy to GPU

问题

My problem: How should I perform fast matrix multiplication in mxnet?

My concrete problem: array copy to GPU is slow. What can be done about it?

I create random arrays, copy them to the context, and then multiply.

import mxnet as mx
import mxnet.ndarray as nd

from mxnet import profiler

profiler.set_config(aggregate_stats=True)

ctx = mx.cpu()

# create arrays on CPU
profiler.set_state('run')
a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))

# copy arrays to the context
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))

# multiply arrays
profiler.set_state('run')
c = nd.dot(a_ctx, b_ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))

In this code I perform everything on cpu, so my times are (sec):

 0.246
 ~=0
 1.727

When I use ctx=mx.gpu(), the times are

 0.247
22.059
 0.828

So the bottleneck is a copy from CPU to GPU. It's just ridiculously slow. What can be done about it?

This is a precise information about this stage:

Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: gpu/0                           2      400000.0000      400000.0000      800000.0000      200000.0000

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXImperativeInvokeEx                    2       22059.0703           0.0360       22059.0352       11029.5352
MXNDArrayGetShape                       2           0.0030           0.0000           0.0030           0.0015
MXNDArrayWaitAll                        1         105.9830         105.9830         105.9830         105.9830
MXNDArrayCreateEx                       2           0.0150           0.0060           0.0090           0.0075
MXNDArrayGetContext                     2           0.0020           0.0000           0.0020           0.0010
MXNet C API Concurrency                22           0.0000           0.0000           0.0010           0.0005
MXNDArrayGetDType                       2           0.0010           0.0000           0.0010           0.0005
MXNet C API Calls                      11           0.0140           0.0040           0.0140           0.0050

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
CopyCPU2GPU                             4         318.4930          53.3060         105.9400          79.6233

Please tell me if more information is needed.

回答1:

You can see from your profiling results that CopyCPU2GPU only takes 318ms. The extra overhead of 22 seconds is related to GPU-context initialization and malloc. If you simply run the GPU-copy code a second time in the same script, you should see a much faster result. You can modify your code like this:

# copy arrays to the context
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))

Another thing to consider is to minimize the CPU->GPU memory copy. For example in your specific example, you can create random arrays in GPU instead of CPU:

a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)

CUDA memory allocation/deallocation requires some system synchronization which makes it slow. All DL framworks take memory management into their own hands but creating a buffer pool that reuses previously allocated buffers and doing memory allocation/deallocation only when absolutely necessary. For example tensorflow allocates the entire GPU memory by default in a single allocation and internally allocates it to tensors. MXNet and PyTorch allocate when necessary, but keep in buffer pool when released so that it can be reused later.

This behavior of MXNet/PyTorch means that on very first call to create a tensor of a specific size, the call would be slower. But if that tensor is released and a new tensor of similar size is created, this time the memory comes from pre-allocated buffer pool rather than using cudamalloc. You can read PyTorch's memory management here (https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management) which is somewhat similar to MXNet.

来源：https://stackoverflow.com/questions/57260730/mxnet-slow-array-copy-to-gpu

标签

python

performance

gpu

MXNet