cupy | 易学教程

这一招将 Numpy 加速 700 倍！！！

阅读更多关于这一招将 Numpy 加速 700 倍！！！

作为 Python 语言的一个扩展程序库，Numpy 支持大量的维度数组与矩阵运算，为 Python 社区带来了很多帮助。借助于 Numpy，数据科学家、机器学习实践者和统计学家能够以一种简单高效的方式处理大量的矩阵数据。那么 Numpy 速度还能提升吗？本文介绍了如何利用 CuPy 库来加速 Numpy 运算速度。就其自身来说，Numpy 的速度已经较 Python 有了很大的提升。当你发现 Python 代码运行较慢，尤其出现大量的 for-loops 循环时，通常可以将数据处理移入 Numpy 并实现其向量化最高速度处理。但有一点，上述 Numpy 加速只是在 CPU 上实现的。由于消费级 CPU 通常只有 8 个核心或更少，所以并行处理数量以及可以实现的加速是有限的。这就催生了新的加速工具——CuPy 库。何为 CuPy？ CuPy 是一个借助 CUDA GPU 库在英伟达 GPU 上实现 Numpy 数组的库。基于 Numpy 数组的实现，GPU 自身具有的多个 CUDA 核心可以促成更好的并行加速。 CuPy 接口是 Numpy 的一个镜像，并且在大多情况下，它可以直接替换 Numpy 使用。只要用兼容的 CuPy 代码替换 Numpy 代码，用户就可以实现 GPU 加速。 CuPy 支持 Numpy 的大多数数组运算，包括索引、广播

Cupy freeing unified memory

阅读更多关于 Cupy freeing unified memory

问题 I have a problem with freeing allocated memory in cupy. Due to memory constraints, I want to use unified memory. When I create a variable that will be allocated to the unified memory and want to free it, it is labelled as being freed and that the pool is now empty, to be used again, but when I take a look at a resource monitor, the memory is still not freed. When I create another variable it also adds to memory (I thought that perhaps the memory labelled as taken would be reused as is

Cupy freeing unified memory

阅读更多关于 Cupy freeing unified memory

Calculate partitioned sum efficiently with CuPy or NumPy

阅读更多关于 Calculate partitioned sum efficiently with CuPy or NumPy

问题 I have a very long array* of length L (let's call it values ) that I want to sum over, and a sorted 1D array of the same length L that contains N integers with which to partition the original array – let's call this array labels . What I'm currently doing is this ( module being cupy or numpy ): result = module.empty(N) for i in range(N): result[i] = values[labels == i].sum() But this can't be the most efficient way of doing it (it should be possible to get rid of the for loop, but how?).

cupy.RawModule using name_expressions and nvcc and/or path

阅读更多关于 cupy.RawModule using name_expressions and nvcc and/or path

问题 I am using CuPy for testing cuda kernels from a library. More specifically I use the cupy.RawModule to exploit the kernels in python. However, the kernels are templated and enclosed in a namespace. Before the name_expressions parameter to RawModule in CuPy 8.0.0, I had to copy the c++-mangled names into the get_function() method manually of the RawModule. Using name_expressions I thought that this should be possible, nevertheless, this requires the code to be compiled from source using the

Using cupy to create a distance matrix from another matrix on GPU

阅读更多关于 Using cupy to create a distance matrix from another matrix on GPU

问题 I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9. import numpy as np #import cupy as np def l1_distance(arr): return np.linalg.norm(arr, 1) X = np.random.randint(low=0, high=255, size=

Where is @cupy.fuse cupy python decorator documented?

阅读更多关于 Where is @cupy.fuse cupy python decorator documented?

问题 I've seen some demos of @cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. ( This is why one might be better off using numba @jit) @cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and

Check that `name_expressions` is iterable

阅读更多关于 Check that `name_expressions` is iterable

问题 When trying out the new jitify support planned for CuPy v9.x, I found that the name_expressions named argument to cupy.RawModule needs to be iterable for the NVRTC to not fail when later calling get_function . Question stemming out of cupy.RawModule using name_expressions and nvcc and/or path. def mykernel(): grid = (...) blocks = (...) args = (...) with open('my_cuda_cpp_code.cu') as f: code = f.read() kers = ('nameofkernel') mod = cp.RawModule(code=code, jitify=True, name_expressions=kers,

Check that `name_expressions` is iterable

阅读更多关于 Check that `name_expressions` is iterable

Why is half-precision complex float arithmetic not supported in Python and CUDA?

阅读更多关于 Why is half-precision complex float arithmetic not supported in Python and CUDA?

问题 NumPY has complex64 corresponding to two float32's. But it also has float16's but no complex32. How come? I have signal processing calculation involving FFT's where I think I'd be fine with complex32, but I don't see how to get there. In particular I was hoping for speedup on NVidia GPU with cupy. However it seems that float16 is slower on GPU rather than faster. Why is half-precision unsupported and/or overlooked? Also related is why we don't have complex integers, as this may also present