1.在训练模型时指定GPU的编号
- 设置当前使用的GPU设备仅为0号设备,设备名称为"/gpu:0",
os.environ["CUDA_VISIBLE_DEVICES"]="0"
; - 设置当前使用的GPU设备为0,1两个设备,名称依次为"/gpu:0","/gpu:1",
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
;根据顺序优先表示使用0号设备,然后使用1号设备; - 同样,也可以在训练脚本外面指定,
CUDA_VISIBLE_DEVICES=0,1 python train.py
,注意,如果此时使用的是8卡中的6和7,CUDA_VISIBLE_DEVICES=6,7 python train.py
,但是在模型并行化的时候,仍然指定0和1,model=nn.DataParallel(mode, devices=[0,1]
;
在这里,需要注意的是,指定GPU的命令需要放在和网络模型操作的最前面;
2.查看模型每层的输如输出详情
- 1.需要安装torchsummary或者torchsummaryX(pip install torchsummary);
- 2.使用示例如下:
from torchvision import models
vgg16 = models.vgg16()
vgg16 = vgg16.cuda()
# 1.torchsummary使用方法
from torchsummary import summary
summary(vgg16, (3, 224, 224)) # (3, 224, 224)是网络模型的输入尺寸
# 2.torchsummaryX使用方法
from torchsummaryX import summary as summaryX
inputx = torch.randn(1, 3, 224, 224)
summaryX(vgg16, inputx)
输出的结果如下图所示(每层输出的shape以及模型的计算量):
3.梯度裁剪:防止在模型优化过程中出现梯度爆炸或者弥散
import torch
import torch.nn as nn
...
outputx = model(inputx)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()
nn.utils.clip_grad_norm_
的参数:
- parameters:基于变量的迭代器,会进行梯度归一化;
- max_norm:梯度的最大范数;
- norm_type:规定范数的类型,默认为L2;
- 需要注意的是,梯度裁剪在某些任务上会额外消耗大量的计算时间。
4.扩张单张图片的维度
因为在模型训练的时候,输入数据的维度是(batch_size,c,h,w),而在测试的时候是单张图片(c,h,w),所以会需要进行维度扩张
import cv2
import torch
import numpy as np
####### 基于numpy的方法 #########
# 方法1.
image = cv2.imread(imgpath)
print(image.shape)
image = image[np.newaxis, :, :, :]
print(image.shape)
####### 基于pytorch的方法 #########
# 方法2.
image = cv2.imread(imgpath)
image = torch.tensor(image)
print(image.shape)
image = image.view(1, *image.shape)
print(image.shape)
# 方法3.
image = cv2.imread(imgpath)
image = torch.tensor(image)
print(image.shape)
image = image.unsqueeze(dim=0)
print(image.shape)
tensor.unsqueeze(dim)
:扩展维度,dim指定扩展哪个维度;tensor.squeeze(dim)
:去除dim指定的且size为1的维度,当维度都大于1时,seqeeze()
不起作用,不指定dim时,去除所有size为1的维度。
5.one-hot编码
在PyTorch里面的定义的交叉熵的时候,会自动把label转换成one-hot编码,所以不需要手动转换,而使用MSE需要手动转换成one-hot编码,以下是转换示例:
import torch
class_num = 8
batch_size = 4
def one_hot(label):
"""
Convert the label of one division to one-hot
Argument:
label: (type, tensor), the gt label, shape: (batch_size,)
Return:
one_hot_out: (type, tensor), the one-hot label, shape: (batch_size, class_num)
"""
label = label.resize_(batch_size, 1)
m_zeros = torch.zeros(batch_size, class_num)
one_hot_out = m_zeros.scatter_(1, label, 1) # (dim, index, value)
return one_hot_out
label = torch.LongTensor(batch_size).random_() % class_num
print(one_hot(label))
在PyTorch1.1之后,one_hot
函数可以直接调用torch.nn.functional.one_hot
import torch
import torch.nn.functional as F
tensor = torch.arange(0, 5) % 3
one_hot = F.one_hot(tensor)
# F.one_hot会检测不同类别的个数,生成对应的one-hot,也可以自己定义类别数
one_hot = F.one_hot(tensor, num_classes=10)
6.在验证模型时,防止显存爆炸
在验证模型的过程中是不需要求导,既不需要梯度计算,关闭autograd
,可以提高速度,节约内存,如果不关闭可能会爆显存:
with torch.no_grad():
model.eval()
7.学习率的衰减策略
在模型的训练过程中动态地调整学习率,避免陷入局部优化点。
import torch
import torch.optim as optim
from torch.optim import lr_scheduler
# init optimier
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1) # 每隔10个epoch,学习率乘以0.1
# train process
for n in n_epoch:
scheduler.step()
...
8.训练过程中冻结某些层的参数
当加载预训练模型的时候,或者在迁移学习中的分类模型,需要冻结前面几层,保证其features不动,使其在训练过程中不发生变化。
from torchvision import models
net = models.vgg16()
for name, value in net.named_parameters():
print('name: {0}, \t grad: {1}'.format(name, value.requires_grad)
no_grad = ['cnn.VGG_16.convolution1_1.weight',
'cnn.VGG_16.convolution1_1.bias'
]
for name, value in net.named_parameters():
if name in no_grad:
value.requires_grad = False
else:
value.requires_grad = True
# 定义优化器
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)
9.训练过程中针对不同的层设置不同的学习率
根据模型在优化过程中,会根据需要,对不同的层,设置不同的的学习率,代码如下:
from torchvision import models
net = models.vgg16()
for name, value in net.named_parameters():
print('name: {}'.format(name)
# split the layer according to the key words,
# feature layers:finetune,classifiery layers:from scratch
conv_params = []
fc_params = []
for name, params in net.named_parameters():
if 'conv' in name:
conv_params += [params]
else:
fc_params += [params]
# define the optimizer
optimizer = optim.Adam([
{
'params': conv_params, 'lr': 1e-4},
{
'params': fc_params, 'lr': 1e-2}], weight_decay=1e-3)
将模型层划分为两部分,存放于一个列表中,每个部分就对应上面的一个字典,在字典里设置不同的学习率。当这两部分有相同的其他参数时,就将该参数放到列表外面作为全局参数,就像上面的’weight_decay’。也可以在列表外面设置一个全局学习率,当各个部分字典里设置了局部学习率时,就使用该学习率,否则就使用列表外面的全局学习率optimizer = optim.Adam([{'params': conv_params, 'lr': 1e-4}], lr=1e-2, weight_decay=1e-3)
10.模型的保存和加载方式
在模型的训练过程中需要对模型进行保存,使用模型的时候需要加载训练好的模型。Pytorch中保存和加载模型的主要分为两类:1. 保存加载整个模型;2. 只保存加载模型参数;
1.保存加载模型基本用法
- 保存加载整个模型(网络结构+模型的参数,比较耗时)
# save model
torch.save(model, 'net.pkl')
# load model
model = torch.load('net.pkl') # the model must have be defined
- 只保存加载模型参数(速度快,占内存少,推荐方法)
# save model parameters
torch.save(model.state_dict(), 'net_params.pkl'
# load model parameters, must build model firstly, load parameters secondly
model = Net()
state_dict = torch.load('net_params.pkl')
model.load_state_dict(state_dict)
2.保存加载自定义模型
上面保存的net.pkl文件其实是一个字典,通常包括以下内容: a.网络结构:输入尺寸,输出尺寸以及隐含层信息,以便能够在加载时重建模型; b.模型的权重参数:包括各个网络层训练后的可学习参数,可以在模型实例上调用state_dict()
方法来获取,比如只保存模型权重参数时用到的model.state_dict()
; c.优化器参数:有时候保存模型之后需要接着训练,那么就必须保存优化器的状态和所使用的超参数,也就是在优化器实例上调用state_dict()方法来获取这些参数; d.其他信息:有时候需要保存其他信息,比如epoch,batch_size等超参数。 这样就可以自定义需要保存的内容,如下所示。
# saving a checkpoint assuming the network class named Net
checkpoint = {
'model':Net(),
'model_state_dict':model.state_dict(),
'optimizer_state_dict':optimizer.state_dict(),
'epoch':epoch
}
torch.save(chekpoint, 'checkpoint.pkl')
# load the model infor
def load_checkpoint(filepath):
checkpoint = torch.load(filepath)
model = checkpoint['model'] # 网络结构
model.load_state_dict(checkpoint['model_state_dict']) # 加载网络模型参数
optimizer = optim.SGD()
optimizer.load_state_dict(checkpoint['optimizer_state_dict']) # 加载优化器参数
for params in model.parameters():
params.requires_grad = False
model.eval()
return model
model = load_checkpoint('checkpoint.pkl')
加载模型是为了进行测试,则将每一层的requires_grad
置为False
,固定这些参数;还需要调用model.eval()
将模型置为测试模式,主要是将Dropout和BatchNormalization进行固定,否则模型的预测结果每次都会不同。如果继续训练,则调用model.train()
确保网络模型处于训练模式。
3.跨设备保存加载模型
-
在GPU上训练的模型,在CPU上加载(Save on GPU, Load on CPU):
device = torch.device('cpu') model = Net() # load all tensors onto the CPU device model.load_state_dict(torch.load('net_params.pkl', map_location=device)) # <===> model.load_state_dict(torch.load('net_params.pkl', map_location='cpu'))
-
在GPU上训练的模型,在GPU上加载(Save on GPU, Load on GPU):
device = torch.device('cuda') model = Net() model.load_state_dict(torch.load('net_params.pkl')) model.to(device)
在这里使用map_location参数不起作用,要使用model.to(torch.device("cuda"))
将模型转换为CUDA优化的模型。
还需要对将输入模型的数据调用data=data.to(device)
,即将数据从CPU转到GPU。注意,调用my_tensor.to(device)
会返回一个my_tensor在GPU上的副本,它不会覆盖my_tensor。因此需要手动覆盖张量:my_tensor = my_tensor.to(device)
-
在CPU上训练的模型,在GPU上加载(Save on CPU, Load on GPU):
device = torch.device('cuda') model = Net() model.load_state_dict(torch.load('net_params.pkl', map_location='cuda:0')) model.to(device)
11.GPU相关的几个函数
# 判断cuda时候可用
print(torch.cuda.is_available()
# 获取gpu数量
print(torch.cuda.device_count()
# 获取gpu名字
print(torch.cuda.get_device_name(0))
# 获取当前gpu设备索引,默认从0开始
print(torch.cuda.current_device())
# 将模型和数据从cpu移到gpu
use_cuda = torch.cuda.is_available()
# 方法1
if use_cuda:
data = data.cuda()
model.cuda()
# 方法2
device = torch.device('cuda' if use_cuda else 'cpu')
data = data.to(device)
model.to(device)
12.打印模型在inference中的特征图
- 包装模型(在forward中输出特征图);
import os
import cv2
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
class FeatureVisualizaiton:
input_size = 256
def __init__(self, imgpath='', layers_idx=[1, 2], save_features_dir='/'):
self.imgpath = imgpath
self.layers_idx = layers_idx
self.save_features_dir= save_features_dir
self.net = models.vgg16()
@staticmethod
def preprocess_image(imgpath):
assert os.path.isfile(imgpath), "The image of {%s} must be existed!" % imgpath
img = cv2.imread(imgpath)
# resize
img = cv2.resize(img, (input_size, input_size))
# normalize as [0, 1]
img = (img / 255.).astype('float32').transpose((2, 0, 1))[np.newaxis, :, :, :] # (1, 3, 256, 256)
# <===>
# img = (img / 255.).astype('float32').swapaxis(1, 2).swapaxis(0, 1)
# img = np.expand_dims(img, axis=0)
img = torch.from_numpy(img)
return img
def get_features(self):
"""Extract features"""
features = {
}
inputx = self.preprocess_image(self.imgpath)
print('inputx shape', inputx.shape)
if torch.cuda.is_available():
inputx = inputx.cuda()
model = self.net.cuda()
x = inputx
for index, (name, module) in enumerate(model.named_modules()):
x = module(x)
if index in self.layers_idx:
features[name] = x
return features
def save_features(self):
"""Save features"""
features = self.get_features()
for name, feature in features.items():
feature = self.process_feature(feature)
cv2.imwrite(os.path.join(self.save_features_dir, name + '.jpg'), feature)
@statcimethod
def process_feature(feature):
"""
Normalize the feature
Arguments:
feature: (type, tensor(b, c, h, w)), normalize to (0, 255)
"""
feature = feature.cpu().detach().numpy()
# use sigmoid to [0, 1]
feature = (1.0 / (1 + np.exp(-1 * feature))
feature = np.round(feature * 255)
return feature
if __name__ == '__main__':
featurevisualization = FeatureVisualization()
featurevisualization.save_features()
- 使用hook:利用pytorch里面的hook,可以不改变输入输出中间的网络结构,可以方便的获取,改变网络中间层的值和梯度(几种hook和forward,backward的先后关系在
nn.module
的__call__
函数里面可以看得更清楚),可以看到,对于register_forward_hook
在forward的调用之后。
import os
import cv2
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
class FeatureVisualizaiton:
input_size = 256
def __init__(self, imgpath='', layers_idx=[1, 2], save_features_dir='/'):
self.imgpath = imgpath
self.layers_idx = layers_idx
self.save_features_dir= save_features_dir
self.net = models.vgg16()
@staticmethod
def preprocess_image(imgpath):
assert os.path.isfile(imgpath), "The image of {%s} must be existed!" % imgpath
img = cv2.imread(imgpath)
# resize
img = cv2.resize(img, (input_size, input_size))
# normalize as [0, 1]
img = (img / 255.).astype('float32').transpose((2, 0, 1))[np.newaxis, :, :, :] # (1, 3, 256, 256)
# <===>
# img = (img / 255.).astype('float32').swapaxis(1, 2).swapaxis(0, 1)
# img = np.expand_dims(img, axis=0)
img = torch.from_numpy(img)
return img
def get_features(self):
"""Extract features"""
features = {
}
inputx = self.preprocess_image(self.imgpath)
print('inputx shape', inputx.shape)
if torch.cuda.is_available():
inputx = inputx.cuda()
model = self.net.cuda()
# closure
def get_activation(name):
def hook(model, input, output):
features[name] = output.detach()
return hook
# register hook
for layer_idx in self.layers_idx:
handle = model[layer_idx].register_forward_hook(get_activation(str(layer_idx))
outputx = model(inputx)
handle.remove()
return features
def save_features(self):
"""Save features"""
features = self.get_features()
for name, feature in features.items():
feature = self.process_feature(feature)
cv2.imwrite(os.path.join(self.save_features_dir, name + '.jpg'), feature)
@statcimethod
def process_feature(feature):
"""
Normalize the feature
Arguments:
feature: (type, tensor(b, c, h, w)), normalize to (0, 255)
"""
feature = feature.cpu().detach().numpy()
# use sigmoid to [0, 1]
feature = (1.0 / (1 + np.exp(-1 * feature))
feature = np.round(feature * 255)
return feature
if __name__ == '__main__':
featurevisualization = FeatureVisualization()
featurevisualization.save_features()
13.Tensor类型之间的转换(三种方式)
-
使用独立函数:
import torch import torch.nn as nn x = torch.randn(3, 5) print(x) # convert x as long x_long = x.long() # convert x as half x_half = x.half() # convert x as int x_int = x.int() # convert x as double x_double = x.double() # convert x as float x_float = x.float() # convert x as char x_char = x.char() # convert x as byte x_byte = x.byte() # convert x as short x_short = x.short()
-
使用**torch.type()**函数:
import torch import torch.nn as nn x = torch.randn(3, 5) x_int = x.type(torch.IntTensor) print(x_int)
-
使用**type_as(ano_tensor)**将tensor转换为给定类型的tensor:
import torch import torch.nn as nn x = torch.FloatTensor(5) y = torch.IntTensor([10, 20]) x_int = x.type_as(y) assert isinstance(x_int, torch.IntTensor)
该文章总结了自己在pytorch使用过程中的一些小技术积累,后续会持续更新。如果有错误不当之处,欢迎各位大牛批评指正!
来源:oschina
链接:https://my.oschina.net/u/4416282/blog/4941172