这里主要分成两步:
- 1.先配置基本的开发环境
- 2.下载数据集并配置
- Numpy
- Pillow 1.0
- tf Slim (which is included in the “tensorflow/models/research/” checkout)
- Jupyter notebook
- Matplotlib
- Tensorflow
关于TensorFlow的安装,典型的指令如下,具体参考官方:
# For CPU pip install tensorflow # For GPU pip install tensorflow-gpu
其他的工具包:
sudo apt-get install python-pil python-numpy sudo pip install jupyter sudo pip install matplotlib
在tensorflow/models/research/
目录下:
# From tensorflow/models/research/ export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
注意,运行后要将所以终端关闭,重启~
快速测试,调用model_test.py:
# From tensorflow/models/research/ python deeplab/model_test.py
先从Cityscapes website上下载数据集。这需要注册账号(最好用带edu的邮箱注册).下载数据集leftImg8bit_trainvaltest.zip (11GB)和对应的标注集gtFine_trainvaltest.zip (241MB)。
下载好数据集,解压(这需要一段时间~),注意我的数据集的目录为/root/dataset/cityscapesScripts
。
在/root/dataset/cityscapesScripts
下clone Cityscapes的脚本代码:
git clone https://github.com/mcordts/cityscapesScripts.git
此时我的工程目录:
+ /root/dataset/cityscapesScripts + cityscapesScripts + leftImg8bit + gtFine + tfrecord + exp #这个目录和子目录需要手动创建 + train_on_train_set + train + eval + vis
修改/root/dataset/cityscapesScripts/cityscapesScripts/preparation/createTrainIdLabelImgs.py
代码,为了方便,我直接将cityscapesPath
:
cityscapesPath = '/root/dataset/cityscapesScripts'
修改后的代码如下(注意这是和我前面的工程目录是一致的):
# The main method def main(): ... cityscapesPath = '/root/dataset/cityscapesScripts' searchFine = os.path.join( cityscapesPath , "gtFine" , "*" , "*" , "*_gt*_polygons.json" ) searchCoarse = os.path.join( cityscapesPath , "gtCoarse" , "*" , "*" , "*_gt*_polygons.json" )
前面讲完了如何配置环境和下载数据集,这里看看如何将下载好的数据集转成训练用的tfrecord文件。
git clone https://github.com/tensorflow/models
注意,这个一共有400+M,如果你的github下载速度很慢,可以先用迅雷下载完成后,再解压。
在models-master/research/deeplab/datasets
目录下,查看并修改convert_cityscapes.sh
文件。将其中的CITYSCAPES_ROOT
设置到对应的cityscapes主目录:
CITYSCAPES_ROOT="/root/dataset/cityscapesScripts"
修改后如下的convert_cityscapes.sh
代码如下:
# Exit immediately if a command exits with a non-zero status. set -e CURRENT_DIR=$(pwd) WORK_DIR="." # Cityscapes dataset的根目录,依据你的实际设置 CITYSCAPES_ROOT="/root/dataset/cityscapesScripts" # 调用cityscapes的脚本,创建label python "${CITYSCAPES_ROOT}/cityscapesscripts/preparation/createTrainIdLabelImgs.py" # 创建TFRecord文件 # 1. 创建存储TFRecords的目录 OUTPUT_DIR="${CITYSCAPES_ROOT}/tfrecord" mkdir -p "${OUTPUT_DIR}" BUILD_SCRIPT="${CURRENT_DIR}/build_cityscapes_data.py" # 2. 调用创建TFRecords代码 echo "Converting Cityscapes dataset..." python "${BUILD_SCRIPT}" \ --cityscapes_root="${CITYSCAPES_ROOT}" \ --output_dir="${OUTPUT_DIR}" \
运行我们刚修改好的convert_cityscapes.sh
文件,生成tfrecord文件:
sh convert_cityscapes.sh
程序运行结果如下:
Processing 5000 annotation files Progress: 100.0 % Converting Cityscapes dataset... >> Converting image 298/2975 shard 0 >> Converting image 596/2975 shard 1 >> Converting image 894/2975 shard 2 >> Converting image 1192/2975 shard 3 >> Converting image 1490/2975 shard 4 >> Converting image 1788/2975 shard 5 >> Converting image 2086/2975 shard 6 >> Converting image 2384/2975 shard 7 >> Converting image 2682/2975 shard 8 >> Converting image 2975/2975 shard 9 >> Converting image 50/500 shard 0 >> Converting image 100/500 shard 1 >> Converting image 150/500 shard 2 >> Converting image 200/500 shard 3 >> Converting image 250/500 shard 4 >> Converting image 300/500 shard 5 >> Converting image 350/500 shard 6 >> Converting image 400/500 shard 7 >> Converting image 450/500 shard 8 >> Converting image 500/500 shard 9
生成的文件在cityscapes/tfrecrod
下.
在model_zoo上下载预训练模型:
下载的预训练权重为xception_cityscapes_trainfine
: 下载地址:439M
下载后解压:
tar -zxvf deeplabv3_cityscapes_train_2018_02_06.tar.gz
需要注意对应的解压文件目录: /root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train
。
官方给出的指令格式:
python deeplab/train.py \ --logtostderr \ --training_number_of_steps=90000 \ --train_split="train" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --train_crop_size=769 \ --train_crop_size=769 \ --train_batch_size=1 \ --dataset="cityscapes" \ --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \ --train_logdir=${PATH_TO_TRAIN_DIR} \ --dataset_dir=${PATH_TO_DATASET}
这里参考调试指令参考local_test.sh,其中有几个比较关键的参数设置如下:
training_number_of_steps
: 训练迭代次数,这里只是验证,故设置较小为1000train_crop_size
:训练图片的裁剪大小,因为我的GPU只有8G,故我将这个设置为513了train_batch_size
:训练的batchsize,也是因为硬件条件,故保持1,在FAQ等提出如果想复现paper效果,建议设置8tf_initial_checkpoint
:预训练的初始checkpoint,这里设置的即是前面下载的/root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt
train_logdir
: 保存训练权重的目录,注意在开始的创建工程目录的时候就创建了,这里设置为'/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train'
dataset_dir
:数据集的地址,前面创建的TFRecords目录。这里设置为'/root/dataset/cityscapesScripts/tfrecord'
python deeplab/train.py \ --logtostderr \ --training_number_of_steps=1000 \ --train_split="train" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --train_crop_size=513 \ --train_crop_size=513 \ --train_batch_size=1 \ --dataset="cityscapes" \ --tf_initial_checkpoint='/root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \ --train_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train' \ --dataset_dir='/root/dataset/cityscapesScripts/tfrecord'
训练输出部分截取:
... INFO:tensorflow:global step 830: loss = 0.8061 (0.309 sec/step) INFO:tensorflow:global step 840: loss = 0.4264 (0.313 sec/step) INFO:tensorflow:global step 850: loss = 0.3628 (0.320 sec/step) INFO:tensorflow:global step 860: loss = 0.3926 (0.317 sec/step) INFO:tensorflow:global step 870: loss = 4.0512 (0.310 sec/step) INFO:tensorflow:global step 880: loss = 3.7159 (0.312 sec/step) INFO:tensorflow:global step 890: loss = 1.1838 (0.310 sec/step) INFO:tensorflow:global step 900: loss = 0.3242 (0.318 sec/step) INFO:tensorflow:global step 910: loss = 0.6457 (0.322 sec/step) INFO:tensorflow:global step 920: loss = 0.3715 (0.317 sec/step) INFO:tensorflow:global step 930: loss = 0.4298 (0.308 sec/step) INFO:tensorflow:global step 940: loss = 0.8075 (0.317 sec/step) INFO:tensorflow:global step 950: loss = 2.2744 (0.322 sec/step) INFO:tensorflow:global step 960: loss = 1.0538 (0.337 sec/step) INFO:tensorflow:global step 970: loss = 0.4115 (0.338 sec/step) INFO:tensorflow:global step 980: loss = 0.3840 (0.311 sec/step) INFO:tensorflow:global step 990: loss = 0.3733 (0.381 sec/step) INFO:tensorflow:global step 1000: loss = 0.7128 (0.341 sec/step) INFO:tensorflow:Stopping Training. INFO:tensorflow:Finished training! Saving model to disk.
ERROR1
报错:
ModuleNotFoundError: No module named 'nets'
解决办法:
export PYTHONPATH="$PYTHONPATH:/root/tensorflow/models/slim"
将slim添加到PYTHONPATH中,并重启所有terminal
ERROR2
报错:
InvalidArgumentError (see above for traceback): Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2 [[Node: image_pooling/BatchNorm/moving_variance_2 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_2/tag, image_pooling/BatchNorm/moving_variance/read)]]
显存不够,考虑减少batchsize或者crop_size。例如:train_crop_size=513
。
ERROR3
W tensorflow/core/framework/allocator.cc:101] Allocation of 239808000 exceeds 10% of system memory.
显存不够,考虑减少batchsize或者crop_size。例如:train_crop_size=513
。
这里官方给出了eval和vis(可视化)两组代码。
eval
前面我们训练了一些模型,下面测试一下。
官方给出的验证指令格式为:
# From tensorflow/models/research/ python deeplab/eval.py \ --logtostderr \ --eval_split="val" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --eval_crop_size=1025 \ --eval_crop_size=2049 \ --dataset="cityscapes" \ --checkpoint_dir=${PATH_TO_CHECKPOINT} \ --eval_logdir=${PATH_TO_EVAL_DIR} \ --dataset_dir=${PATH_TO_DATASET}
这里参考调试指令参考local_test.sh,其中有几个比较关键的参数设置如下:
eval_crop_size
:验证图片的裁剪大小checkpoint_dir
:预训练的checkpoint,这里设置的即是前面训练模型存储的地址/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train
eval_logdir
: 保存验证结果的目录,注意在开始的创建工程目录的时候就创建了,这里设置为'/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/eval'
dataset_dir
:数据集的地址,前面创建的TFRecords目录。这里设置为'/root/dataset/cityscapesScripts/tfrecord'
# From tensorflow/models/research/ python deeplab/eval.py \ --logtostderr \ --eval_split="val" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --eval_crop_size=1025 \ --eval_crop_size=2049 \ --dataset="cityscapes" \ --checkpoint_dir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train' \ --eval_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/eval' \ --dataset_dir='/root/dataset/cityscapesScripts/tfrecord'
对应的输出output:
INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting evaluation at 2018-06-04-12:56:30 INFO:tensorflow:Evaluation [50/500] INFO:tensorflow:Evaluation [100/500] INFO:tensorflow:Evaluation [150/500] INFO:tensorflow:Evaluation [200/500] INFO:tensorflow:Evaluation [250/500] INFO:tensorflow:Evaluation [300/500] INFO:tensorflow:Evaluation [350/500] INFO:tensorflow:Evaluation [400/500] INFO:tensorflow:Evaluation [450/500] INFO:tensorflow:Evaluation [500/500] INFO:tensorflow:Finished evaluation at 2018-06-04-13:00:21 miou_1.0[0.72425]
如果直接用官方给出的checkpoint,效果应该是很好的。
# From tensorflow/models/research/ python deeplab/vis.py \ --logtostderr \ --vis_split="val" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --vis_crop_size=1025 \ --vis_crop_size=2049 \ --dataset="cityscapes" \ --colormap_type="cityscapes" \ --checkpoint_dir=${PATH_TO_CHECKPOINT} \ --vis_logdir=${PATH_TO_VIS_DIR} \ --dataset_dir=${PATH_TO_DATASET}
这里参考调试指令参考local_test.sh,其中有几个比较关键的参数设置如下:
vis_crop_size
:图片的裁剪大小checkpoint_dir
:预训练的checkpoint,这里设置的即是前面训练模型存储的地址/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train
vis_logdir
: 保存可视化结果的目录,注意在开始的创建工程目录的时候就创建了,这里设置为'/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/vis'
dataset_dir
:数据集的地址,前面创建的TFRecords目录。这里设置为'/root/dataset/cityscapesScripts/tfrecord'
# From tensorflow/models/research/ python deeplab/vis.py \ --logtostderr \ --vis_split="val" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --vis_crop_size=1025 \ --vis_crop_size=2049 \ --dataset="cityscapes" \ --colormap_type="cityscapes" \ --checkpoint_dir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train' \ --vis_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/vis' \ --dataset_dir='/root/dataset/cityscapesScripts/tfrecord'
运行输出output:
INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Restoring parameters from /root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train/model.ckpt-1000 INFO:tensorflow:Visualizing batch 1 / 500 INFO:tensorflow:Visualizing batch 2 / 500 INFO:tensorflow:Visualizing batch 3 / 500 INFO:tensorflow:Visualizing batch 4 / 500 INFO:tensorflow:Visualizing batch 5 / 500 INFO:tensorflow:Visualizing batch 6 / 500 INFO:tensorflow:Visualizing batch 7 / 500 INFO:tensorflow:Visualizing batch 8 / 500 INFO:tensorflow:Visualizing batch 9 / 500 INFO:tensorflow:Visualizing batch 10 / 500 ...
截取了部分效果图:
――――――
因为官方的代码写的很工程化,有时间,下次做一个关于非官方DeepLabv3+代码分析与实现,那么分析起来比较容易~