1. 摘要
本文介绍基于京东云GPU云主机快速搭建基于Tensorflow深度学习平台的过程,并分享如何利用Tensorflow benchmark工具进行GPU云主机基准性能测试,帮助读者快速、经济地使用云服务厂商提供的GPU计算资源。
2. 京东云GPU云主机简介
京东云GPU云主机以京东云基于虚拟化技术的云主机为基础,以直通(Passthrough)模式为云主机分配NVIDIA P40或者V100 GPU卡,并免费配置大容量本地数据盘,充分满足深度学习、科学计算等计算需求。当前京东云GPU云主机提供如下规格。
实例规格 | GPU | vCPU** (核) ** | ** 内存****(GB)** | 本地数据** 盘(GB)** |
---|---|---|---|---|
p.n1p40.3xlarge | 1*P40 | 12 | 48 | 1*960 SSD |
p.n1p40.7xlarge | 2*P40 | 28 | 110 | 2*960 SSD |
p.n1p40.14xlarge | 4*P40 | 56 | 220 | 4*960 SSD |
p.n1p40h.3xlarge | 1*P40 | 12 | 48 | 1*1200 HDD |
p.n1p40h.7xlarge | 2*P40 | 28 | 110 | 2*1200 HDD |
p.n1p40h.14xlarge | 4*P40 | 56 | 220 | 4*1200 HDD |
p.n1v100.2xlarge | 1*V100 | 8 | 44 | 1*6000 HDD |
p.n1v100.5xlarge | 2*V100 | 20 | 110 | 2*6000 HDD |
p.n1v100.10xlarge | 4*V100 | 40 | 220 | 4*6000 HDD |
京东云GPU云主机系统盘可以采用免费的本地SSD盘,以可以采用京东云云盘。数据盘可以除支持标准云盘外,还可以使用免费自带临时数据盘,从而降低存储成本。值得注意的是,临时数据盘的数据在重新启动云主机后将丢失,并需要重新mount。
在计费模式上,京东云GPU云主机支持按秒计费功能,而且私有镜像免费。这样当不需要GPU云主机,可先把GPU云主机保存为私有镜像,然后删除云主机。对于需要长期保存的数据,可存储在京东云云硬盘上,也可以存储在对象存储上。
3. 搭建Tensorflow深度学习环境
3.1 创建云主机环境
在创建云主机前,确保在特定区域具有VPC和子网,然后在创建云主机界面中选中特定GPU云主机规格,并绑定公网带宽。如果是做试验,建议选择“按配置“计费规则采用按秒计费,带宽选择"按流量使用“计费规则。
当前京东云不同地域(华北-北京、华东-宿迁、华东-上海、华南-广州)所提供的GPU规格可能有所不同。如需要特定规格,可通过控制台提高工单寻求技术支持。本文将以华东-上海地域的.n1v100.2xlarge(1块NVIDIA Tesla V100)规格为例介绍整个Tensorflow环境的创建过程。
在创建完成GPU云主机后,可下载一个服务器性能评测工具Geekbench,解压缩后,运行 ./geekbench4 --sysinfo命令将获得主机CPU和内存信息。
#下载性能评测工具
wget http://cdn.geekbench.com/Geekbench-4.3.0-Linux.tar.gz
#解压缩
tar zxvf Geekbench-4.3.0-Linux.tar.gz
#获得信息信息
./geekbench4 --sysinfo
System Information
Operating System Ubuntu 16.04.5 LTS 4.4.0-62-generic x86_64
Model JD JCloud Iaas Jvirt
Motherboard N/A
Memory 43.2 GB
BIOS SeaBIOS 1.10.2-1.el7
Processor Information
Name Intel Xeon E5-2650 v4
Topology 1 Processor, 4 Cores, 8 Threads
Identifier GenuineIntel Family 6 Model 79 Stepping 1
Base Frequency 2.20 GHz
L1 Instruction Cache 32.0 KB x 8
L1 Data Cache 32.0 KB x 8
L2 Cache 4.00 MB x 4
L3 Cache 16.0 MB
此外通过运行lspci命令,能确认当前云主机是否配置了NVIDIA卡。
root@jdcoe-gpu-srv001:~/Geekbench-4.3.0-Linux# lspci |grep NVIDIA
00:07.0 3D controller: NVIDIA Corporation Device 1db4 (rev a1)
root@jdcoe-gpu-srv001:~/Geekbench-4.3.0-Linux# lspci -v -s 00:07.0
00:07.0 3D controller: NVIDIA Corporation Device 1db4 (rev a1)
Subsystem: NVIDIA Corporation Device 1214
Physical Slot: 7
Flags: bus master, fast devsel, latency 0, IRQ 10
Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
Memory at c00000000 (64-bit, prefetchable) [size=16G]
Memory at 1000000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_396, nvidia_396_drm
执行fdisk -l命令,能看到/dev/vdb块存储。该块存储是京东云GPU云主机自带的临时数据盘。详细挂载方式请参考https://docs.jdcloud.com/cn/cloud-disk-service/linux-partition。
Disk /dev/vda: 40 GiB, 42949672960 bytes, 83886080 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcd5caf0d
Device Boot Start End Sectors Size Id Type
/dev/vda1 * 2048 83886079 83884032 40G 83 Linux
Disk /dev/vdb: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DAF88709-5654-418B-AFDC-3FE113466B7A
3.2 安装Nvidia驱动
标准的Ubuntu镜像不带任何GPU相关软件。首先通过如下命令下载并安装NVDIA驱动程序。
wget http://us.download.nvidia.com/tesla/396.44/nvidia-diag-driver-local-repo-ubuntu1604-396.44_1.0-1_amd64.deb
dpkg -i nvidia-diag-driver-local-repo-ubuntu1604-396.44_1.0-1_amd64.deb
apt-key add /var/nvidia-diag-driver-local-repo-396.44/7fa2af80.pub
apt-get update
apt-get install cuda-drivers
安装完成,重新启动云主机,并运行nvidia-smi,能看到如下信息。从输出信息可以看出,本云主机的GPU卡为Tesla V100-PCIE。
root@jdcoe-gpu-srv001:~/Geekbench-4.3.0-Linux# nvidia-smi
Fri Nov 23 17:31:47 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 36C P0 36W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
3.3 安装CUDA
根据Tensorflow的安装指南(https://tensorflow.google.cn/install/gpu),Tensorflow要求CUDA 9.0。
为了简化CUDA的下载,本文采用京东云对象存储上存放的CUDA安装包。通过如下命令完成下载和安装。
wget http://solution.oss.cn-north-1.jcloudcs.com/machine-learning/nvidia/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64.de
dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
apt-get update
apt-get install cuda
修改.profile,在PATH环境变量中增加CUDA的可执行程序路径。
root@jdcoe-gpu-srv01:~# cat .profile
# ~/.profile: executed by Bourne-compatible login shells.
if [ "$BASH" ]; then
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
fi
export PATH=/usr/local/cuda/bin:${PATH}
mesg n || true
重新通过ssh连接云主机,执行如下命令获得CUDA版本信息。
root@jdcoe-gpu-srv01:/usr/local/cuda/bin# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
下面运行CUDA的一个自带例子deviceQuery,获得当前云主机的GPU卡信息。
#拷贝CUDA范例文件到当前目录。
root@jdcoe-gpu-srv01:~# cuda-install-samples-9.0.sh .
Copying samples to ./NVIDIA_CUDA-9.0_Samples now...
Finished copying samples.
#进入deviceQuery目录,生成可执行文件。
root@jdcoe-gpu-srv01:~# cd NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
root@jdcoe-gpu-srv01:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery# make
/usr/local/cuda-9.0/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-9.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o deviceQuery deviceQuery.o
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release
#执行deviceQuery获得GPU设备信息。
root@jdcoe-gpu-srv001:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla V100-PCIE-16GB"
CUDA Driver Version / Runtime Version 9.2 / 9.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 7
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
root@jdcoe-gpu-srv001:~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery#
3.4 安装 cuDNN
Tensorflow环境依赖NVIDIA的深度神经网络库(Deep Neural Network library,简称cuDNN)。为了省去从NVDIA网站下载而需要的注册步骤,可执行如下命令下载并安装。
wget http://solution.oss.cn-north-1.jcloudcs.com/machine-learning/nvidia/libcudnn7_7.3.1.20-1%252Bcuda9.0_amd64.deb
wget http://solution.oss.cn-north-1.jcloudcs.com/machine-learning/nvidia/libcudnn7-dev_7.3.1.20-1%252Bcuda9.0_amd64.deb
wget http://solution.oss.cn-north-1.jcloudcs.com/machine-learning/nvidia/libcudnn7-doc_7.3.1.20-1%252Bcuda9.0_amd64.deb
#安装前面下载的3个deb安装程序。
dpkg -i libcudnn7*
下面,运行cuDNN带的手字数字识别例子验证cuDNN环境。
cp -r /usr/src/cudnn_samples_v7/ $HOME
cd $HOME/cudnn_samples_v7/mnistCUDNN
make clean && make
cd /root/cudnn_samples_v7/mnistCUDNN
./mnistCUDNN
cudnnGetVersion() : 7301 , CUDNN_VERSION from cudnn.h : 7301 (7.3.1)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 80 Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16160, MemClock 877.0 Mhz, Ecc=1, boardGroupID=0
3.5 安装Tensorflow
安装Tensorflow的过程看参考安装指南(https://tensorflow.google.cn/install/pip)。具体命令如下:
sudo apt update
#安装pip3
sudo apt install python3-dev python3-pip
#安装python虚拟环境管理工具
sudo pip3 install -U virtualenv # system-wide install
#创建虚拟环境
virtualenv --system-site-packages -p python3 ./venv
#启用虚拟环境
source ./venv/bin/activate # sh, bash, ksh, or zsh
#安装Tensorflow GPU版本
pip install --upgrade tensorflow-gpu
在安装完成后,可获得Tensorflow的版本信息。当前安装的Tensorflow版本是1.12.0。
(venv) root@jdcoe-gpu-srv001:~# pip show tensorflow-gpu
Name: tensorflow-gpu
Version: 1.12.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /root/venv/lib/python3.5/site-packages
Requires: tensorboard, keras-preprocessing, absl-py, keras-applications, numpy, grpcio, six, protobuf, astor, termcolor, wheel, gast
Required-by:
最后执行如下命令可验证Tensorflow安装是否成功。
(venv) root@jdcoe-gpu-srv001:~# python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
2018-11-23 18:21:06.649847: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-23 18:21:07.232982: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-23 18:21:07.233502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:00:07.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
2018-11-23 18:21:07.233547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-23 18:21:07.636860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-23 18:21:07.636940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-23 18:21:07.636949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-23 18:21:07.637325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14873 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
tf.Tensor(-1229.7545, shape=(), dtype=float32)
(venv) root@jdcoe-gpu-srv001:~#
4 获得bechmark
为了评估GPU云主机的计算能力,可运行Tensorflow自带的Benchmark工具。
#首先下载benchmark工具
git clone https://github.com/tensorflow/benchmarks.git
cd benchmarkds
#列出所有分支
(venv) root@jdcoe-gpu-srv001:~/benchmarks# git branch -r
origin/HEAD -> origin/master
origin/cnn_tf_v1.10_compatible
origin/cnn_tf_v1.11_compatible
origin/cnn_tf_v1.12_compatible
origin/cnn_tf_v1.5_compatible
origin/cnn_tf_v1.8_compatible
origin/cnn_tf_v1.9_compatible
origin/cpbr-patch
origin/cpbr-patch-1
origin/data-gen
origin/keras-benchmarks
origin/master
origin/mkl_experiment
origin/tf_benchmark_stage
#Checkout特定的版本,因为我们当前的Tensorflow环境是1.12,所以checkout cnn_tf_v1.12_compatible分支,获得Tensorflow 1.12对应的benchmark代码。
git checkout cnn_tf_v1.12_compatible
执行如下命令运行resnet50模型对GPU云主机进行benchmark测试。通过输出结果,可获得当前GPU云主机的处理能力是308.71张图片/秒。
(venv) root@jdcoe-gpu-srv001:~/benchmarks# python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32
2018-11-23 18:29:06.607030: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-23 18:29:07.186656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-23 18:29:07.187232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:00:07.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
2018-11-23 18:29:07.187292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-23 18:29:07.582838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-23 18:29:07.582888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-23 18:29:07.582897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-23 18:29:07.583285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14873 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
TensorFlow: 1.12
Model: resnet50
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 32 global
32.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
W1123 18:29:11.799468 140439299352320 tf_logging.py:125] From /root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-11-23 18:29:12.382962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-23 18:29:12.383049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-23 18:29:12.383058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-23 18:29:12.383063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-23 18:29:12.383496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14873 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
I1123 18:29:13.106875 140439299352320 tf_logging.py:115] Running local_init_op.
I1123 18:29:13.153315 140439299352320 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 309.3 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 308.9 +/- 0.4 (jitter = 0.7) 7.997
20 images/sec: 309.1 +/- 0.2 (jitter = 0.7) 8.259
30 images/sec: 309.3 +/- 0.2 (jitter = 0.6) 8.338
40 images/sec: 309.3 +/- 0.1 (jitter = 0.6) 8.192
50 images/sec: 309.3 +/- 0.1 (jitter = 0.6) 7.756
60 images/sec: 309.3 +/- 0.1 (jitter = 0.6) 8.066
70 images/sec: 309.2 +/- 0.1 (jitter = 0.6) 8.484
80 images/sec: 309.1 +/- 0.1 (jitter = 0.7) 8.285
90 images/sec: 309.0 +/- 0.1 (jitter = 0.8) 8.009
100 images/sec: 308.9 +/- 0.1 (jitter = 0.8) 7.991
----------------------------------------------------------------
total images/sec: 308.71
----------------------------------------------------------------
5 删除GPU云主机
在不需要GPU云主机时,可先停止该云主机,并创建镜像,然后再删除云主机。这样,当下次再需GPU资源时,可基于该私有镜像快速创建,可减少上述安装配置过程。
6 参考文献
来源:CSDN
作者:jdcloud_architecture
链接:https://blog.csdn.net/jdcloud_architecture/article/details/84402112