背景说明:TensorFlowOnYARN开源时间较早,目前作者已经停止维护,业内推荐TonY系统;
安装环境:Centos 7.0或以上,物理机,Python-2.7.5虚拟环境(默认),tensorflow-1.13.1;
下载地址:https://github.com/linkedin/TonY,git clone https://github.com/linkedin/TonY.git;
组件安装:apt-get update,apt-get install wget,apt-get install vim,apt-get install git,上传jdk,编辑.bashrc配置java环境;
代码编译:./gradlew build或./gradlew build -x test,前者构建并测试,后者只构建不测试,目标文件位于./tony-cli/build/libs/路径下;
目标文件:
root@b9683a1b9302:~/TonY/tony-cli/build/libs# ll -h
total 29M
drwxr-xr-x 2 root root 4.0K Nov 4 03:17 ./
drwxr-xr-x 9 39040 staff 4.0K Nov 4 03:01 ../
-rw-r--r-- 1 root root 29M Nov 4 03:17 tony-cli-0.3.23-all.jar
-rw-r--r-- 1 root root 12K Nov 4 03:01 tony-cli-0.3.23.jar
Python-3.7.0安装:(非必须)
- wget https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz,下载安装包;
- tar -xvf Python-3.7.0.tgz,安装包解压缩,cd Python-3.7.0进入安装包根目录;
- ./configure --enable-optimizations,初始化Makefile编译文件;
- make altinstall,编译文件,python3.7目标文件位于/usr/local/bin路径下;
Python虚拟环境构建:
- wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz;
- tar -xvf virtualenv-16.0.0.tar.gz,解压缩安装包;
- python virtualenv-16.0.0/virtualenv.py venv,构建虚拟环境;
- 执行命令“. venv/bin/activate“,进入虚拟环境;
- pip install tensorflow==1.13.1,在虚拟环境中安装TensorFlow组件;
- zip -r venv.zip venv,对Python虚拟环境进行压缩,目标文件约268MB;
pip install tensorflow==1.13.1清单:
- Downloading https://files.pythonhosted.org/packages/d2/ea/ab2c8c0e81bd051cc1180b104c75a865ab0fc66c89be992c4b20bbf6d624/tensorflow-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl (92.5MB)
- Downloading https://files.pythonhosted.org/packages/3b/72/e6e483e2db953c11efa44ee21c5fdb6505c4dffa447b4263ca8af6676b62/absl-py-0.8.1.tar.gz (103kB)
- Downloading https://files.pythonhosted.org/packages/88/ec/f598b633c3d5ffe267aaada57d961c94fdfa183c5c3ebda2b6d151943db6/backports.weakref-1.0.post1-py2.py3-none-any.whl
- Downloading https://files.pythonhosted.org/packages/89/ac/48dd71c2bdc8d31e367f9b72f25ccb3b89bc6b9d664fee21f9a8efa5714d/tensorboard-1.13.1-py2-none-any.whl (3.2MB)
- Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
- Downloading https://files.pythonhosted.org/packages/d7/b1/3367ea1f372957f97a6752ec725b87886e12af1415216feec9067e31df70/numpy-1.16.5-cp27-cp27mu-manylinux1_x86_64.whl (17.0MB)
- Downloading https://files.pythonhosted.org/packages/05/d2/f94e68be6b17f46d2c353564da56e6fb89ef09faeeff3313a046cb810ca9/mock-3.0.5-py2.py3-none-any.whl
- Downloading https://files.pythonhosted.org/packages/21/56/4bcec5a8d9503a87e58e814c4e32ac2b32c37c685672c30bc8c54c6e478a/Keras_Applications-1.0.8.tar.gz (289kB)
- Downloading https://files.pythonhosted.org/packages/bb/48/13f49fc3fa0fdf916aa1419013bb8f2ad09674c275b4046d5ee669a46873/tensorflow_estimator-1.13.0-py2.py3-none-any.whl (367kB)
- Downloading https://files.pythonhosted.org/packages/59/54/4441f0b3c44e38b1377d31c137cdaa6dfad225f5ee79612ed87131427baf/grpcio-1.24.3-cp27-cp27mu-manylinux2010_x86_64.whl (2.2MB)
- Downloading https://files.pythonhosted.org/packages/d1/4f/950dfae467b384fc96bc6469de25d832534f6b4441033c39f914efd13418/astor-0.8.0-py2.py3-none-any.whl
- Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
- Downloading https://files.pythonhosted.org/packages/1f/04/4e36c33f8eb5c5b6c622a1f4859352a6acca7ab387257d4b3c191d23ec1d/gast-0.3.2.tar.gz
- Downloading https://files.pythonhosted.org/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl
- Downloading https://files.pythonhosted.org/packages/c5/49/ffa7ab9c52ec56b535cffec3bc844254c073888e6d4aeee464671ac97480/protobuf-3.10.0-cp27-cp27mu-manylinux1_x86_64.whl (1.3MB)
- Downloading https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
- Downloading https://files.pythonhosted.org/packages/ce/42/3aeda98f96e85fd26180534d36570e4d18108d62ae36f87694b476b83d6f/Werkzeug-0.16.0-py2.py3-none-any.whl (327kB)
- Downloading https://files.pythonhosted.org/packages/d8/a6/f46ae3f1da0cd4361c344888f59ec2f5785e69c872e175a748ef6071cdb5/futures-3.3.0-py2-none-any.whl
- Downloading https://files.pythonhosted.org/packages/c0/4e/fd492e91abdc2d2fcb70ef453064d980688762079397f779758e055f6575/Markdown-3.1.1-py2.py3-none-any.whl (87kB)
- Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
- Downloading https://files.pythonhosted.org/packages/12/90/3216b8f6d69905a320352a9ca6802a8e39fdb1cd93133c3d4163db8d5f19/h5py-2.10.0-cp27-cp27mu-manylinux1_x86_64.whl (2.8MB)
Hadoop搭建:hadoop集群搭建
工程目录:
MyJob/
myjob.sh(执行脚本)
> src/(工程代码)
> models/
mnist_distributed.py
tony.xml(工程配置)
tony-cli-0.3.23-all.jar(TonY系统jar包)
venv.zip(python虚拟环境)
myjob.sh脚本:
#!/bin/sh
java -cp `hadoop classpath`:/home/homework/MyJob/tony-cli-0.3.23-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/home/homework/MyJob/venv.zip \ # python虚拟环境压缩包路径
--src_dir=/home/homework/MyJob/src/models \ # 工程代码路径
--executes=mnist_distributed.py \ # 工程代码main文件
--task_params="--steps 1000 --data_dir /tmp/data --working_dir /tmp/model" \ # 主程入口参数
--conf_file=/home/homework/MyJob/tony.xml \ # tony.xml配置文件路径
--python_binary_path=venv/bin/python # python虚拟环境python路径
tony.xml配置:
<configuration>
<property>
<name>tony.worker.instances</name>
<value>2</value>
<description>worker总数</description>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
<description>worker内存</description>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
<description>ps总数</description>
</property>
<property>
<name>tony.ps.memory</name>
<value>3g</value>
<description>ps内存</description>
</property>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
<description>从集群以及客户端和AM之间获取令牌</description>
</property>
</configuration>
venv.zip列表:(25967个文件)
[homework@localhost]$ unzip -Z1 venv.zip | head -n 10
venv/
venv/lib/
venv/lib/python2.7/
venv/lib/python2.7/sre_compile.pyc
venv/lib/python2.7/no-global-site-packages.txt
venv/lib/python2.7/_abcoll.pyc
venv/lib/python2.7/copy_reg.py
venv/lib/python2.7/distutils/
venv/lib/python2.7/distutils/__init__.py
venv/lib/python2.7/distutils/__init__.pyc
mnist_distributed.py代码:
cluster_spec_str = os.environ["CLUSTER_SPEC"] # 读取系统环境变量
cluster_spec = json.loads(cluster_spec_str) # 环境变量解析
ps_hosts = cluster_spec['ps'] # 获取ps地址配置
worker_hosts = cluster_spec['worker'] # 获取worker地址配置
# 从参数server和worker hosts创建集群
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# 为本地任务创建并启动服务器
job_name = os.environ["JOB_NAME"]
task_index = int(os.environ["TASK_INDEX"])
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index)
执行任务:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/homework/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/homework/MyJob/tony-cli-0.3.23-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-08 10:46:51,237 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
2019-11-08 10:46:51,303 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, yarn-default.xml, yarn-site.xml, resource-types.xml, null/core-site.xml, null/hdfs-site.xml
2019-11-08 10:46:51,368 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-08 10:46:52,271 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:46:56,196 INFO cli.ClusterSubmitter: Copying /mnt/homework/MyJob/tony-cli-0.3.23-all.jar to: hdfs:// localhost :9000/user/homework/.tony/5077ae03-8d28-4bc3-8195-d0daea3e3018
2019-11-08 10:46:56,231 INFO tony.TonyClient: TonY heartbeat interval [1000]
2019-11-08 10:46:56,231 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
2019-11-08 10:46:56,254 INFO tony.TonyClient: Starting client..
2019-11-08 10:46:56,258 INFO client.RMProxy: Connecting to ResourceManager at localhost/192.168.0.100:8032
2019-11-08 10:46:56,489 INFO conf.Configuration: resource-types.xml not found
2019-11-08 10:46:56,489 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-11-08 10:46:56,531 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:46:56,545 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:06,235 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,497 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,723 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,736 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,748 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,981 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:14,021 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.ApplicationMaster 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
2019-11-08 10:47:14,023 INFO tony.TonyClient: Submitting YARN application
2019-11-08 10:47:14,071 INFO impl.YarnClientImpl: Submitted application application_1573174209638_0003
2019-11-08 10:47:14,072 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://localhost:8088/proxy/application_1573174209638_0003/
2019-11-08 10:47:14,072 INFO tony.TonyClient: ResourceManager web address for application: http://localhost:8088/cluster/app/application_1573174209638_0003
2019-11-08 10:47:24,111 INFO tony.TonyClient: Driver (application master) log url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000001/homework
2019-11-08 10:47:24,111 INFO tony.TonyClient: AM host: localhost
2019-11-08 10:47:24,111 INFO tony.TonyClient: AM RPC port: 14894
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: ps, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework status: RUNNING
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework status: RUNNING
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework status: RUNNING
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for ps 0 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for worker 0 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for worker 1 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework
2019-11-08 10:48:42,503 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework status: SUCCEEDED
2019-11-08 10:48:43,506 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework status: SUCCEEDED
2019-11-08 10:48:44,509 INFO tony.TonyClient: Task status updated: [TaskInfo] name: ps, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework status: FINISHED
2019-11-08 10:48:45,512 INFO tony.TonyClient: Application 3 finished with YarnState=FINISHED, DSFinalStatus=SUCCEEDED, breaking monitoring loop.
2019-11-08 10:48:45,512 INFO tony.TonyClient: Link for application_1573174209638_0003's events/metrics: https://localhost:19886/jobs/application_1573174209638_0003
2019-11-08 10:48:45,518 INFO tony.TonyClient: Sent message to AM to stop.
2019-11-08 10:48:45,518 INFO tony.TonyClient: Application completed successfully
2019-11-08 10:48:45,535 INFO impl.YarnClientImpl: Killed application application_1573174209638_0003
结果输出:
[homework@localhost]$ ll
total 76764
-rw-rw-r-- 1 homework homework 39295624 Nov 8 11:11 model.ckpt-0.data-00000-of-00001
-rw-rw-r-- 1 homework homework 994 Nov 8 11:11 model.ckpt-0.index
-rw-rw-r-- 1 homework homework 39295624 Nov 8 11:11 model.ckpt-1002.data-00000-of-00001
-rw-rw-r-- 1 homework homework 994 Nov 8 11:11 model.ckpt-1002.index
任务提交:
任务运行:
任务完成:
任务终止:
yarn application -kill application_1573094688604_0003(任务ID),kill任务;
任务失败用例:
- 修改mnist_distributed.py文件190行CLUSTER_SPEC为CLUSTER_SPEC1;
- 提交任务再次执行,任务执行失败,打印错误信息如下;
- 查看日志,显示错误信息如下:
过程描述:
- INFO cli.ClusterSubmitter: Copying /mnt/homework/MyJob/tony-cli-0.3.23-all.jar to: hdfs://localhost:9000/user/homework/.tony/1bc58531-bc24-453c-869f-b3530f44277e,将TonY.jar提交到hdfs上;
- INFO impl.YarnClientImpl: Submitted application application_1573183872429_0002,复制工程所需配置、代码与依赖至hdfs上;
说明:
-
- 当任务正常终止的话,hdfs会自动清理数据,当任务被异常终止的话,hdfs会残留中间数据,需定期清理;
- hadoop控制台页面地址无法访问,需将机器名添加到C:\Windows\System32\drivers\etc\hosts文件中即可;
来源:oschina
链接:https://my.oschina.net/u/1376494/blog/3238169