问题
I've seen some demos of @cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. ( This is why one might be better off using numba @jit)
@cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
- Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated. https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with @fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with @numba.jit then subsequently decorating with @fuse. Or would again the @jit render the resulting python in a non-fusable form?
What breaks @fuse? What are the pitfalls? is @fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py
回答1:
SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
1. fusing kernels is such a huge advance I don't understand when I would ever not want to use @fuse. isn't it always better? When is it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empy_like(x) does not work, nor does referring to globals. hence it simply cannot be applied universally.
I'm wondering about it's composability
1. will @fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
2. I see that a bug fix in the release notes says that it can now handle calling other functions decorated with @fuse. But this does not say if their kernels are fused or remain separate.
ANSWER: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whehter you do or do not.
4. Why isn't it documented?
ANSWWR: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.
来源:https://stackoverflow.com/questions/53639723/where-is-cupy-fuse-cupy-python-decorator-documented