问题
My problem: How should I perform fast matrix multiplication in mxnet?
My concrete problem: array copy to GPU is slow. What can be done about it?
I create random arrays, copy them to the context, and then multiply.
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import profiler
profiler.set_config(aggregate_stats=True)
ctx = mx.cpu()
# create arrays on CPU
profiler.set_state('run')
a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
# copy arrays to the context
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
# multiply arrays
profiler.set_state('run')
c = nd.dot(a_ctx, b_ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
In this code I perform everything on cpu, so my times are (sec):
0.246
~=0
1.727
When I use ctx=mx.gpu()
, the times are
0.247
22.059
0.828
So the bottleneck is a copy from CPU to GPU. It's just ridiculously slow. What can be done about it?
This is a precise information about this stage:
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: gpu/0 2 400000.0000 400000.0000 800000.0000 200000.0000
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXImperativeInvokeEx 2 22059.0703 0.0360 22059.0352 11029.5352
MXNDArrayGetShape 2 0.0030 0.0000 0.0030 0.0015
MXNDArrayWaitAll 1 105.9830 105.9830 105.9830 105.9830
MXNDArrayCreateEx 2 0.0150 0.0060 0.0090 0.0075
MXNDArrayGetContext 2 0.0020 0.0000 0.0020 0.0010
MXNet C API Concurrency 22 0.0000 0.0000 0.0010 0.0005
MXNDArrayGetDType 2 0.0010 0.0000 0.0010 0.0005
MXNet C API Calls 11 0.0140 0.0040 0.0140 0.0050
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
CopyCPU2GPU 4 318.4930 53.3060 105.9400 79.6233
Please tell me if more information is needed.
回答1:
You can see from your profiling results that CopyCPU2GPU
only takes 318ms. The extra overhead of 22 seconds is related to GPU-context initialization and malloc. If you simply run the GPU-copy code a second time in the same script, you should see a much faster result. You can modify your code like this:
# copy arrays to the context
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
Another thing to consider is to minimize the CPU->GPU memory copy. For example in your specific example, you can create random arrays in GPU instead of CPU:
a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)
CUDA memory allocation/deallocation requires some system synchronization which makes it slow. All DL framworks take memory management into their own hands but creating a buffer pool that reuses previously allocated buffers and doing memory allocation/deallocation only when absolutely necessary. For example tensorflow allocates the entire GPU memory by default in a single allocation and internally allocates it to tensors. MXNet and PyTorch allocate when necessary, but keep in buffer pool when released so that it can be reused later.
This behavior of MXNet/PyTorch means that on very first call to create a tensor of a specific size, the call would be slower. But if that tensor is released and a new tensor of similar size is created, this time the memory comes from pre-allocated buffer pool rather than using cudamalloc. You can read PyTorch's memory management here (https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management) which is somewhat similar to MXNet.
来源:https://stackoverflow.com/questions/57260730/mxnet-slow-array-copy-to-gpu