使用GPU，并行计算和编译优化加速numpy矩阵运算（相关材料整理）

v1(主要针对numpy运算的加速) 2020/12/16

总结：基于GPU加速numpy：cupy 和 minpy

基于编译的优化加速numpy：numba

基于并行计算加速numpy：Mars

既可以并行又可以用GPU：Mars

文章目录

使用GPU，并行计算和编译优化加速numpy矩阵运算（相关材料整理）
- cupy
- minpy和MXnet
- mars
- jit和numba
- RAPIDS

numpy学习网址：

https://numpy.net/

https://www.numpy.org.cn/

http://cs231n.stanford.edu/syllabus.html

cupy

cupy支持使用GPU来加速Numpy。

cupy documents：https://docs.cupy.dev/en/stable/

如果已经安装好cuda，安装cupy只需要（安装之前一定要保证pip更新到最新的版本）

$ pip install cupy

也可以使用下面这个方法安装，

#根据自己安装的cuda版本是哪一个，然后直接下载安装适合的版本，实测这个方法比较快


#然后进行安装命令
# CUDA 8.0
pip install cupy-cuda80
 
# CUDA 9.0
pip install cupy-cuda90
 
# CUDA 9.1
pip install cupy-cuda91
 
# CUDA 9.2
pip install cupy-cuda92
 
# CUDA 10.0
pip install cupy-cuda100
 
# CUDA 10.1
pip install cupy-cuda101

1.安装的官方文档：https://docs.cupy.dev/en/stable/install.html#installing-cupy

2.官方的github网址：https://github.com/cupy/cupy

github里面有安装教程，可以在docker上运行，还有使用cupy加速运算的代码实例

3.cupy的api接口文档：https://docs.cupy.dev/en/stable/reference/index.html

4.关于cupy的使用和加速效果的博客：

（英文的简单介绍博客）https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56

https://www.jiqizhixin.com/articles/2019-08-29-8

https://www.jianshu.com/p/b5a6ee8564df

https://blog.csdn.net/ChenVast/article/details/100140494

minpy和MXnet

minpy在效果和用法上，好像和cupy差不多。但从网上可以找到资料的多少来看，minpy没有cupy那么热门

minpy的接口与numpy都一样

所以只需要修改import语句，就可以将numpy的计算进行GPU加速。

但是有的接口不支持GPU加速，这个时候，minpy会自动将这个接口的函数在CPU上像numpy一样运行（可以有效减少bug）。这个特性不知道cupy是不是可以。

import minpy.numpy as np

安装过程：

（MXnet是亚马逊发布的深度学习库，这个库可以支持minpy，所以使用minpy前必须首先安装好MXNet和cuda）

安装MXNet（安装网址在下面，实测安装的很快，2分钟）
安装minpy

1.MXNet的安装网址：https://mxnet.incubator.apache.org/get_started?

2.minpy官方文档：https://minpy.readthedocs.io/en/latest/index.html

3.numpy中文网给出的minpy说明文档：https://www.numpy.org.cn/article/other/minpy-the-numpy-interface-upon-mxnets-backend.html

4.github关于minpy的博客+代码：https://github.com/dmlc/minpy

MXNet的github网址https://github.com/apache/incubator-mxnet

5.关于minpy使用的博客：https://www.sohu.com/a/124626121_465975

关于minpy加速效果测试的博客：https://blog.csdn.net/DarrenXf/article/details/86305215

介绍MXNet深度学习框架的博客：https://www.jiqizhixin.com/graph/technologies/c59f79b4-eb36-48f7-842e-aefd4397799a

https://www.jiqizhixin.com/articles/2016-08-10-2

mars

Mars 是由阿里云高级软件工程师秦续业等人开发的一个基于张量的大规模数据计算的统一框架。

Mars可以让 Numpy、pandas 和 scikit-learn 等库并行和分布式执行，利用多核优势来缩短程序运行时间。

CPU Time：进程时间也称CPU时间，用以度量进程使用的中央处理器资源。进程时间以时钟嘀嗒计算，实际时间（Real），用户CPU时间（User），系统CPU时间（Sys）

Wall Time：进程运行的时间总量，其值与系统中同时运行的进程数有关从进程从开始运行到结束，时钟走过的时间，这其中包含了进程在阻塞和等待状态的时间。

在使用了mars后，会出现Wall Time < CPU Time 的情形，说明mars利用了多核处理器的并行执行优势。

1.使用mars缩短numpy并行的例子可以参考：mars的官方文档：https://docs.mars-project.io/zh_CN/latest/

2.mars的安装：https://docs.mars-project.io/zh_CN/latest/installation/index.html

3.在集群中部署mars：https://docs.mars-project.io/zh_CN/latest/installation/deploy.html#deploy

4.mars不仅可以利用多核来加速程序，也可以使用GPU（单卡，单机多卡，分布式）来加速numpy（mars tensor 依赖于cupy，所以也要先安装cupy），指定 gpu=True 详情可以参考：https://docs.mars-project.io/zh_CN/latest/getting_started/gpu.html#gpu

5.mars的github官方网址：https://github.com/mars-project/mars

6.mars的介绍性博客：https://blog.csdn.net/weixin_42137700/article/details/85274241

https://www.zhihu.com/question/307050812/answer/561528003

mars的网上资源较少，基本所有能找到的技术信息都主要在官方文档中，并且官方文档写的比较友好，参考官方文档就可以解决问题。

jit和numba

jit和numba是相辅相成的。

JIT（just-in-time compilation）：当某段代码要被执行之间，进行一下编译，因而叫“即时编译”。

Numba在运行时使用LLVM编译将Python函数提前编译一下，转换为优化的机器码，从而实现在计算过程中的加速。Numba用Python编译的数值算法可以接近C或FORTRAN的速度。不需要替换Python解释器，只需要from numba inport jit，就可以用了。

根据官方文档，在实际将代码向numba迁移的过程中，还需要其他的代码添加，例如@jit等。

import numpy as np
import numba
from numba import jit

#numba对于for循环的加速是非常明显的

1.关于jit的博客和document：https://blog.csdn.net/shenwansangz/article/details/95601232

https://developer.ibm.com/zh/articles/j-lo-just-in-time/

2.关于numba的介绍性博客： https://www.cnblogs.com/zhuwjwh/p/11401215.html

https://zhuanlan.zhihu.com/p/68720474（这个介绍了LLVM）

https://zhuanlan.zhihu.com/p/60994299

关于numba安装的博客：https://blog.csdn.net/marchphy/article/details/52207878

https://www.jianshu.com/p/5341ad607b71

3.numba的官方文档：https://numba.readthedocs.io/en/stable/index.html

4.numba的官网：https://numba.pydata.org/

5.numba的github网址：https://github.com/numba/numba

RAPIDS

感觉RAPIDS更多的是利用GPU对pandas和scikit-learn的加速，从而加速机器学习。

RAPIDS库中的cuDF相当于用GPU加速Pandas，cuML相当于用GPU加速scikit-learn。

RAPIDS库对于numpy好像没有加速的支持。

1.RAPIDS的官网地址：https://rapids.ai/index.html

2.RAPIDS的安装地址：https://rapids.ai/start.html#get-rapids

3.RAPIDS的document网址：https://docs.rapids.ai/

4.RAPIDS的github网址：https://github.com/rapidsai

5.RAPIDS介绍性博客： https://www.sohu.com/a/283723350_100007018

https://www.zhihu.com/question/304042299

https://www.datalearner.com/blog/1051562381920769（这个博客比较好，还介绍了GPU加速的原理）

来源：oschina

链接：https://my.oschina.net/u/4360870/blog/4816244

标签

numpy

github

MXNet

mars-project

(cupy,minpy,mars,numba)使用GPU,并行计算和编译优化加速矩阵运算